Search code examples
pythonpython-3.xlistsyntaxtextblob

How to create a list of TextBlobs?


This my first post on StackOverflow, so please be forgiving with any faux pas I may be making. I'm also new to Python, so any and all tips are welcome. My questions is simple, but no matter what I've tried I can't seem to figure it out. Here is my code:

import os
from bs4 import BeautifulSoup
import string
import nltk
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import FreqDist

# For TF-IDF calculations
import math
from textblob import TextBlob as tb

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

rootDir ='D:\rootDir'
testPath = r'D:\testPath'
trainPath = r'D:\trainPath'

data = []
lemmatizer = WordNetLemmatizer()
stop = stopwords.words("english")
stop += ['also']

bloblist_train = [tb('')]  #Before EDIT: bloblist_train = tb('')
bloblist_test = [tb('')]   #Before EDIT: bloblist_test = tb('')

for currentDirPath, subDirs, files in os.walk(rootDir):
    for file in files:
        with open(os.path.join(currentDirPath, file)) as dataFile:
            inFile = dataFile.read()
            html = BeautifulSoup(inFile, "html.parser")
            text = html.get_text()
            text_no_punc = text.translate(str.maketrans("", "", string.punctuation))
            if testPath in currentDirPath:
                bloblist_test += (tb(text_no_punc))
            elif trainPath in currentDirPath:
                bloblist_train += tb(text_no_punc)
            words = text_no_punc.split()
            data = data + words

I'm iterating over a larger directory of files with HTML documents and am parsing them and then further trying to find the TF-IDF for each word. I'm using a mix of packages and classes for this including BeautifulSoup, NLTK, and TextBlob. I'm using TextBlob to find the TF-IDF, but have run into the issue of creating a list of TextBlobs. The specific lines I'm having issues with are these:

if testPath in currentDirPath:
    bloblist_test += tb(text_no_punc)
elif trainPath in currentDirPath:
    bloblist_train += tb(text_no_punc)

The code presently creates just one giant TextBlob with all the documents concatenated as one TextBlob. I would like a TextBlob for each document. I have tried the following approach as well

if testPath in currentDirPath:
    bloblist_test.append(tb(text_no_punc))
elif trainPath in currentDirPath:
    bloblist_train.append(tb(text_no_punc))

which gives the error:

AttributeError: 'TextBlob' object has no attribute 'append'

What am I missing? Append is the method I have been using to create lists python strings like so:

s1 = [1,2,3]
s2 = [4,5]
s1.append(s2)
# Output: [[1,2,3], [4,5]]

But TextBlobs apparently don't support this.

So how do I go about creating a list of these Textblobs?

EDIT:

So I made some progress on my own, but am still having trouble formatting the list. Instead of initializing bloblist_train and bloblist_test to tb(''), I set them equal to [tb('')] because like my question says, they're suppose to hold a LIST of TextBlobs, not just TextBlobs. So now it would seem...it works! There's just one thing I still can't seem to get right: The way it is now creates a list with one empty TextBlob as the very first item (e.g. [TextBlob(""), TextBlob("one two three")]).

I realize this is a slightly different question than what I started with, so if someone thinks I need to close this question and start a separate one, please let me know. Again, I'm new.

If not, I feel there is a simple keyword or syntactical solution that I'm missing and would greatly appreciate some input.


Solution

  • I eventually discovered the answer myself. I knew the answer seemed trivial, and turns out it was, but the fact that it was working with Textblob, I thought it would change the nature of the answer. Well, it didn't. I'm disappointed in the seasoned pythoners who passed my question by without a thought as all I simply had to do was this:

    bloblist_train = []
    bloblist_test = []
    
    for currentDirPath, subDirs, files in os.walk(rootDir):
        for file in files:
            with open(os.path.join(currentDirPath, file)) as dataFile:
            # .
            # .
            # .
                if testPath in currentDirPath:
                    bloblist_test += [tb(text_no_punc]
                elif trainPath in currentDirPath:
                    bloblist_train += [tb(text_no_punc)]
    

    I was mixed up in the land of other languages, like C++, where I'm have to initialize my variables to their expected type. Instead, thanks to wonderful python, I just had to declare a list, and then add things to it. To other beginners like me, remember, python doesn't care what is in a list. It can be a mix of any object you put in there: int, char, string, Textblob, etc. Just tell it you have a list and then add away.