This my first post on StackOverflow, so please be forgiving with any faux pas I may be making. I'm also new to Python, so any and all tips are welcome. My questions is simple, but no matter what I've tried I can't seem to figure it out. Here is my code:
import os
from bs4 import BeautifulSoup
import string
import nltk
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import FreqDist
# For TF-IDF calculations
import math
from textblob import TextBlob as tb
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob.words)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
rootDir ='D:\rootDir'
testPath = r'D:\testPath'
trainPath = r'D:\trainPath'
data = []
lemmatizer = WordNetLemmatizer()
stop = stopwords.words("english")
stop += ['also']
bloblist_train = [tb('')] #Before EDIT: bloblist_train = tb('')
bloblist_test = [tb('')] #Before EDIT: bloblist_test = tb('')
for currentDirPath, subDirs, files in os.walk(rootDir):
for file in files:
with open(os.path.join(currentDirPath, file)) as dataFile:
inFile = dataFile.read()
html = BeautifulSoup(inFile, "html.parser")
text = html.get_text()
text_no_punc = text.translate(str.maketrans("", "", string.punctuation))
if testPath in currentDirPath:
bloblist_test += (tb(text_no_punc))
elif trainPath in currentDirPath:
bloblist_train += tb(text_no_punc)
words = text_no_punc.split()
data = data + words
I'm iterating over a larger directory of files with HTML documents and am parsing them and then further trying to find the TF-IDF for each word. I'm using a mix of packages and classes for this including BeautifulSoup, NLTK, and TextBlob. I'm using TextBlob to find the TF-IDF, but have run into the issue of creating a list of TextBlobs. The specific lines I'm having issues with are these:
if testPath in currentDirPath:
bloblist_test += tb(text_no_punc)
elif trainPath in currentDirPath:
bloblist_train += tb(text_no_punc)
The code presently creates just one giant TextBlob with all the documents concatenated as one TextBlob. I would like a TextBlob for each document. I have tried the following approach as well
if testPath in currentDirPath:
bloblist_test.append(tb(text_no_punc))
elif trainPath in currentDirPath:
bloblist_train.append(tb(text_no_punc))
which gives the error:
AttributeError: 'TextBlob' object has no attribute 'append'
What am I missing? Append is the method I have been using to create lists python strings like so:
s1 = [1,2,3]
s2 = [4,5]
s1.append(s2)
# Output: [[1,2,3], [4,5]]
But TextBlobs apparently don't support this.
So how do I go about creating a list of these Textblobs?
EDIT:
So I made some progress on my own, but am still having trouble formatting the list. Instead of initializing bloblist_train
and bloblist_test
to tb('')
, I set them equal to [tb('')]
because like my question says, they're suppose to hold a LIST of TextBlobs, not just TextBlobs. So now it would seem...it works! There's just one thing I still can't seem to get right: The way it is now creates a list with one empty TextBlob as the very first item (e.g. [TextBlob(""), TextBlob("one two three")]
).
I realize this is a slightly different question than what I started with, so if someone thinks I need to close this question and start a separate one, please let me know. Again, I'm new.
If not, I feel there is a simple keyword or syntactical solution that I'm missing and would greatly appreciate some input.
I eventually discovered the answer myself. I knew the answer seemed trivial, and turns out it was, but the fact that it was working with Textblob, I thought it would change the nature of the answer. Well, it didn't. I'm disappointed in the seasoned pythoners who passed my question by without a thought as all I simply had to do was this:
bloblist_train = []
bloblist_test = []
for currentDirPath, subDirs, files in os.walk(rootDir):
for file in files:
with open(os.path.join(currentDirPath, file)) as dataFile:
# .
# .
# .
if testPath in currentDirPath:
bloblist_test += [tb(text_no_punc]
elif trainPath in currentDirPath:
bloblist_train += [tb(text_no_punc)]
I was mixed up in the land of other languages, like C++, where I'm have to initialize my variables to their expected type. Instead, thanks to wonderful python, I just had to declare a list, and then add things to it. To other beginners like me, remember, python doesn't care what is in a list. It can be a mix of any object you put in there: int, char, string, Textblob, etc. Just tell it you have a list and then add away.