| 일 | 월 | 화 | 수 | 목 | 금 | 토 |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | |||
| 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| 12 | 13 | 14 | 15 | 16 | 17 | 18 |
| 19 | 20 | 21 | 22 | 23 | 24 | 25 |
| 26 | 27 | 28 | 29 | 30 |
- crawler
- Classes
- Anime
- strings
- MySQL
- .gitignore
- SCV
- MyAnimeList
- github
- Filecoin
- Jupyter Notebook
- python
- commands
- API
- DATABASE
- Blockchain
- Linux
- cached
- ansible
- Methods
- pandas
- JSON
- workbench
- basics
- directories
- project
- noob
- GIT
- Django
- forks
- Today
- Total
제니 블로그
Text Preprocessing 본문
02/21/2023
One of the important steps of sentimental analysis is text preprocessing. After collecting data in the form of text, we need to remove any irrelevant informations (stop words / punctuations) and convert it into a standarized format, like in lowercase.
Stop words
Stop words are words that are usually removed from text before processing it for analysis or indexing. It includes words like :
"the", "a", "an", "in", "is", "of", "and", "that", "this", etc.
These words are considered as words that do not add much meaning to a text and should be removed without affecting the overall meaning of the text. This reduceses amount of texts that needs to be processed.
For this block of code for preprocessing, we used the following libraries :
# NLTK library for data preprocessing
import nltk
from nltk.corpus import stopwords
# regular expression
import re
import string
The NLTK library provides us witth the stopwords module, containing list of stop words for various languages (English in this case) to remove stop words from text.
def text_process(text):
# Convert the text to lowercase
text = text.lower()
# Remove any HTML tags
text = re.sub('<[^>]*>', '', text)
# Replace any URLs with the word 'url'
text = re.sub(r'http\S+', 'url', text)
# Remove any non-word characters (excluding spaces)
text = re.sub(r'[^\w\s!]', ' ', text)
# Remove any numbers
# Remove any stopwords
STOPWORDS = stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure']
text = ' '.join([word for word in text.split() if word not in STOPWORDS])
# Remove any extra whitespace
text = re.sub(r'\s+', ' ', text)
# Now just remove any stopwords
return text.strip()
comment = "loved it!!!!!!!!!!!!!!!! ...;;;;;;;"
cleaned_comment = text_process(comment)
print(cleaned_comment)
This block of code would convert text to lowercase, remove any unnecessary tags, and remove any non-word characters and stopwords that are included in the module.
The result would end up being :
loved it!!!!!!!!!!!!!!!!

This is it for using stop words for data preprocessing!
FYI, I still don't understand the code :)
Ref : ChatGPT - thanks for helping us out.
'Project' 카테고리의 다른 글
| Inserting Data into the database (0) | 2023.03.07 |
|---|---|
| Making a Database Schema (0) | 2023.03.01 |
| Getting the Data from API (0) | 2023.02.21 |
| Getting the Forum ID for episode discussions (0) | 2023.02.19 |
| MyAnimeList Web Crawler (0) | 2023.02.16 |