TF-IDF

07 Oct 2018 Tags:

python3, nlp, tf-idf

TF and IDF

Text Frequency (TF)

TF is an index that shows frequency of words in each document in the corpus. It is simply calculated by the ratio of word counts by the total number of words in that document. Each word has its own TF value

Inverse Document Freqeucny (IDF)

IDF is an index that shows the relative weight of words across all documents in the corpus. In other words, it is a representation of rarity of a word in the set of documents. Each word has its own IDF value

TF-IDF

TF-IDF is a multiplication of TF and IDF values. It is a numerical statistics that aims to reflect significance of word in a particular document (TF), also considering other documents in the group (IDF).

Preprocessing

Before you calculate the TF-IDF of all words, each documents need to be processed. They need to be tokenized
Tokenizing is process of classifying sections of string and parsing them.
For example, the text “He is a good boy” can be tokenized into: [“He”,”is”,”a”,”good”,”boy”]
The processing methods can vary depending on stemming or lemmatization methods

Example code

# dependencies
import os
import nltk

work_dir = "/Users/nowgeun/Desktop/Research/Documents/"
text_files = os.listdir(work_dir)


def pre_processing(text):
    # lowercase
    text=text.lower()
    
    # remove tags
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    return text
    

def tokenize(text):
   tokens = nltk.word_tokenize(text)
    stems = []
    
    #pos(part of speech) tagging: grammatical tagging of word by their category (verb,noun,etc)
    for item in nltk.pos_tag(tokens):
        # Filtering tokens with only Nouns (N)
        if item[1].startswith("N"):
            if len(nltk.wordnet.WordNetLemmatizer().lemmatize(item[0])) == 1:
                pass
            else:
                stems.append(nltk.wordnet.WordNetLemmatizer().lemmatize(item[0]))

    return stems

# creating corpus from text files using preprocessing function 

token_dict = {}

for txt in text_files:
    if txt.endswith(".txt"):
        with open (work_dir+ txt) as f:
        
            data = "".join(f.readlines()).replace("\n"," ")
            data = pre_processing(data)

            token_dict[txt] = data &nbsp; &nbsp;

TF-IDF computation using Scikit-Learn package

Example code

With Scikit-Learn package, you can compute the tf-idf value and retrieve the results as a matrix form.

Each row of matrix represents respective documents
Each column of matrix represents words that appear on all documents
The matrix is sparse-matrix (where most values are 0). This is because not all words appear on all documents or the frequency of the word itself is very low

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(tokenizer=tokenize, stop_words='english')

# matrix of all documents in rows and idf values of respective words in column
matrix = vectorizer.fit_transform(token_dict.values())

Jaekeun Lee's Space Projects About Me Blog Writings Tags

TF-IDF

TF and IDF

TF-IDF

Preprocessing

Example code

TF-IDF computation using Scikit-Learn package

Example code

Related Posts

Transformer 02 Aug 2020

Multiprocessing with Python 25 Apr 2020

Regularization 04 Feb 2020