Search code examples
pythonglobtextblob

Using directory as input for tf-idf with python `textblob`


I am trying to adapt this code (source found here)to iterate through a directory of files, instead of having the input hard-coded.

#!/usr/bin/python
# -*- coding: utf-8 -*-

from __future__ import division, unicode_literals
import math
from textblob import TextBlob as tb

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)


document1 = tb("""Today, the weather is 30 degrees in Celcius. It is really hot""")

document2 = tb("""I can't believe the traffic headed to the beach. It is really a circus out there.'""")

document3 = tb("""There are so many tolls on this road. I recommend taking the interstate.""")

bloblist = [document1, document2, document3]
for i, blob in enumerate(bloblist):
    print("Document {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words:
        score_weight = score * 100 
        print("\t{}, {}".format(word, round(score_weight, 5)))

I would like to use an an input txt files in a directory, rather than each hard-coded document.

For instance, imagine I had a directory foo which contains three files file1, file2, file3.

File 1 contains the contents that document1 contains, i.e.

file1:

Today, the weather is 30 degrees in Celcius. It is really hot

File 2 contains the contents that document2 contains, i.e.

I can't believe the traffic headed to the beach. It is really a circus out there.

File 3 contains the contents that document3 contains, i.e.

There are so many tolls on this road. I recommend taking the interstate.

I have though to use glob to achieve my desired result, and I have come up with the following code adapation, which correctly identifies the files, but does not process them individually, as the original code does:

file_names = glob.glob("/path/to/foo/*")
files =  map(open,file_names)
documents = [file.read() for file in files]
[file.close() for file in files]


bloblist = [documents]
for i, blob in enumerate(bloblist):
    print("Document {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words:
        score_weight = score * 100 
        print("\t{}, {}".format(word, round(score_weight, 5)))

How can I maintain the scores for each individual file using glob?

The desired result after using the files in a directory as input would be the same as the original code [results truncuated to top 3 for space]:

Document 1
    Celcius, 3.37888
    30, 3.37888
    hot, 3.37888
Document 2
    there, 2.38509
    out, 2.38509
    headed, 2.38509
Document 3
    on, 3.11896
    this, 3.11896
    many, 3.11896

A similar question here did not fully solve the problem. I was wondering how I can call the files to calculate the idf but maintain them separately for calculate the full tf-idf?


Solution

  • In your first code example you fill bloblist with results of tb(), and in your second example - with inputs for tb() (just strings).

    Try to replace bloblist = [documents] with bloblist = map(tb, documents).

    You can also sort filename list like this file_names = sorted(glob.glob("/path/to/foo/*")) to make outputs of both versions match.