TF-IDF
07 Oct 2018TF and IDF
Text Frequency (TF)
- TF is an index that shows frequency of words in each document in the corpus. It is simply calculated by the ratio of word counts by the total number of words in that document. Each word has its own TF value
Inverse Document Freqeucny (IDF)
- IDF is an index that shows the relative weight of words across all documents in the corpus. In other words, it is a representation of rarity of a word in the set of documents. Each word has its own IDF value
TF-IDF
- TF-IDF is a multiplication of TF and IDF values. It is a numerical statistics that aims to reflect significance of word in a particular document (TF), also considering other documents in the group (IDF).
Preprocessing
- Before you calculate the TF-IDF of all words, each documents need to be processed. They need to be tokenized
- Tokenizing is process of classifying sections of string and parsing them.
- For example, the text “He is a good boy” can be tokenized into: [“He”,”is”,”a”,”good”,”boy”]
- The processing methods can vary depending on stemming or lemmatization methods
Example code
# dependencies
import os
import nltk
work_dir = "/Users/nowgeun/Desktop/Research/Documents/"
text_files = os.listdir(work_dir)
def pre_processing(text):
# lowercase
text=text.lower()
# remove tags
text=re.sub("</?.*?>"," <> ",text)
# remove special characters and digits
text=re.sub("(\\d|\\W)+"," ",text)
return text
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = []
#pos(part of speech) tagging: grammatical tagging of word by their category (verb,noun,etc)
for item in nltk.pos_tag(tokens):
# Filtering tokens with only Nouns (N)
if item[1].startswith("N"):
if len(nltk.wordnet.WordNetLemmatizer().lemmatize(item[0])) == 1:
pass
else:
stems.append(nltk.wordnet.WordNetLemmatizer().lemmatize(item[0]))
return stems
# creating corpus from text files using preprocessing function
token_dict = {}
for txt in text_files:
if txt.endswith(".txt"):
with open (work_dir+ txt) as f:
data = "".join(f.readlines()).replace("\n"," ")
data = pre_processing(data)
token_dict[txt] = data
TF-IDF computation using Scikit-Learn package
Example code
With Scikit-Learn package, you can compute the tf-idf value and retrieve the results as a matrix form.
- Each row of matrix represents respective documents
- Each column of matrix represents words that appear on all documents
- The matrix is sparse-matrix (where most values are 0). This is because not all words appear on all documents or the frequency of the word itself is very low
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
# matrix of all documents in rows and idf values of respective words in column
matrix = vectorizer.fit_transform(token_dict.values())