Notice
Recent Posts
Recent Comments
Link
«   2026/04   »
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30
Archives
Today
Total
관리 메뉴

제니 블로그

Text Preprocessing 본문

Project

Text Preprocessing

jennystar 2023. 2. 25. 21:05

02/21/2023

One of the important steps of sentimental analysis is text preprocessing. After collecting data in the form of text, we need to remove any irrelevant informations (stop words / punctuations) and convert it into a standarized format, like in lowercase. 

Stop words 

Stop words are words that are usually removed from text before processing it for analysis or indexing. It includes words like : 

"the", "a", "an", "in", "is", "of", "and", "that", "this", etc.

These words are considered as words that do not add much meaning to a text and should be removed without affecting the overall meaning of the text. This reduceses amount of texts that needs to be processed. 

 

For this block of code for preprocessing, we used the following libraries : 

# NLTK library for data preprocessing
import nltk
from nltk.corpus import stopwords
# regular expression
import re
import string

The NLTK library provides us witth the stopwords module, containing list of stop words for various languages (English in this case) to remove stop words from text. 

def text_process(text):
    # Convert the text to lowercase
    text = text.lower()
    # Remove any HTML tags
    text = re.sub('<[^>]*>', '', text)
    # Replace any URLs with the word 'url'
    text = re.sub(r'http\S+', 'url', text)
    # Remove any non-word characters (excluding spaces)
    text = re.sub(r'[^\w\s!]', ' ', text)
    # Remove any numbers
    
    # Remove any stopwords
    STOPWORDS = stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure']
    
    text = ' '.join([word for word in text.split() if word not in STOPWORDS])
    
    # Remove any extra whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Now just remove any stopwords
    return text.strip()

comment = "loved it!!!!!!!!!!!!!!!! ...;;;;;;;"
cleaned_comment = text_process(comment)
print(cleaned_comment)

This block of code would convert text to lowercase, remove any unnecessary tags, and remove any non-word characters and stopwords that are included in the module. 

The result would end up being : 

loved it!!!!!!!!!!!!!!!!

This is it for using stop words for data preprocessing!

FYI, I still don't understand the code :) 

Ref : ChatGPT - thanks for helping us out. 

'Project' 카테고리의 다른 글

Inserting Data into the database  (0) 2023.03.07
Making a Database Schema  (0) 2023.03.01
Getting the Data from API  (0) 2023.02.21
Getting the Forum ID for episode discussions  (0) 2023.02.19
MyAnimeList Web Crawler  (0) 2023.02.16