I am trying to adapt this code (source found here)to iterate through a directory of files, instead of having the input hard-coded.
#!/usr/bin/python
# -*- coding: utf-8 -*-
from __future__ import division, unicode_literals
import math
from textblob import TextBlob as tb
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
document1 = tb("""Today, the weather is 30 degrees in Celcius. It is really hot""")
document2 = tb("""I can't believe the traffic headed to the beach. It is really a circus out there.'""")
document3 = tb("""There are so many tolls on this road. I recommend taking the interstate.""")
bloblist = [document1, document2, document3]
for i, blob in enumerate(bloblist):
print("Document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words:
score_weight = score * 100
print("\t{}, {}".format(word, round(score_weight, 5)))
I would like to use an an input txt files in a directory, rather than each hard-coded document
.
For instance, imagine I had a directory foo
which contains three files file1
, file2
, file3
.
File 1 contains the contents that document1
contains, i.e.
file1:
Today, the weather is 30 degrees in Celcius. It is really hot
File 2 contains the contents that document2
contains, i.e.
I can't believe the traffic headed to the beach. It is really a circus out there.
File 3 contains the contents that document3
contains, i.e.
There are so many tolls on this road. I recommend taking the interstate.
I have though to use glob
to achieve my desired result, and I have come up with the following code adapation, which correctly identifies the files, but does not process them individually, as the original code does:
file_names = glob.glob("/path/to/foo/*")
files = map(open,file_names)
documents = [file.read() for file in files]
[file.close() for file in files]
bloblist = [documents]
for i, blob in enumerate(bloblist):
print("Document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words:
score_weight = score * 100
print("\t{}, {}".format(word, round(score_weight, 5)))
How can I maintain the scores for each individual file using glob
?
The desired result after using the files in a directory as input would be the same as the original code [results truncuated to top 3 for space]:
Document 1
Celcius, 3.37888
30, 3.37888
hot, 3.37888
Document 2
there, 2.38509
out, 2.38509
headed, 2.38509
Document 3
on, 3.11896
this, 3.11896
many, 3.11896
A similar question here did not fully solve the problem. I was wondering how I can call the files to calculate the idf
but maintain them separately for calculate the full tf-idf
?
In your first code example you fill bloblist
with results of tb()
, and in your second example - with inputs for tb()
(just strings).
Try to replace bloblist = [documents]
with bloblist = map(tb, documents)
.
You can also sort filename list like this file_names = sorted(glob.glob("/path/to/foo/*"))
to make outputs of both versions match.