Jaekeun Lee's Space     Projects     About Me     Blog     Writings     Tags

TF-IDF

Tags:

TF and IDF

Text Frequency (TF)

  • TF is an index that shows frequency of words in each document in the corpus. It is simply calculated by the ratio of word counts by the total number of words in that document. Each word has its own TF value

 

Inverse Document Freqeucny (IDF)

  • IDF is an index that shows the relative weight of words across all documents in the corpus. In other words, it is a representation of rarity of a word in the set of documents. Each word has its own IDF value

   

TF-IDF

  • TF-IDF is a multiplication of TF and IDF values. It is a numerical statistics that aims to reflect significance of word in a particular document (TF), also considering other documents in the group (IDF).

   

Preprocessing

  • Before you calculate the TF-IDF of all words, each documents need to be processed. They need to be tokenized
  • Tokenizing is process of classifying sections of string and parsing them.
  • For example, the text “He is a good boy” can be tokenized into: [“He”,”is”,”a”,”good”,”boy”]
  • The processing methods can vary depending on stemming or lemmatization methods
Example code

 

# dependencies
import os
import nltk

work_dir = "/Users/nowgeun/Desktop/Research/Documents/"
text_files = os.listdir(work_dir)


def pre_processing(text):
    # lowercase
    text=text.lower()
    
    # remove tags
    text=re.sub("</?.*?>"," <> ",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    return text
    

def tokenize(text):
   tokens = nltk.word_tokenize(text)
    stems = []
    
    #pos(part of speech) tagging: grammatical tagging of word by their category (verb,noun,etc)
    for item in nltk.pos_tag(tokens):
        # Filtering tokens with only Nouns (N)
        if item[1].startswith("N"):
            if len(nltk.wordnet.WordNetLemmatizer().lemmatize(item[0])) == 1:
                pass
            else:
                stems.append(nltk.wordnet.WordNetLemmatizer().lemmatize(item[0]))

    return stems

# creating corpus from text files using preprocessing function 

token_dict = {}

for txt in text_files:
    if txt.endswith(".txt"):
        with open (work_dir+ txt) as f:
        
            data = "".join(f.readlines()).replace("\n"," ")
            data = pre_processing(data)

            token_dict[txt] = data    

TF-IDF computation using Scikit-Learn package

 

Example code

 

With Scikit-Learn package, you can compute the tf-idf value and retrieve the results as a matrix form.

  • Each row of matrix represents respective documents
  • Each column of matrix represents words that appear on all documents
  • The matrix is sparse-matrix (where most values are 0). This is because not all words appear on all documents or the frequency of the word itself is very low

 

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(tokenizer=tokenize, stop_words='english')

# matrix of all documents in rows and idf values of respective words in column
matrix = vectorizer.fit_transform(token_dict.values())