Search code examples
pythonarraysnlpvectorizationtokenize

Trying to separate my data points into multiple arrays, instead of having one big array


Im working on an nlp project and am working with fake news, with one of the inputs being the headlines. I have tokenized my headlines in the following format:

[['Four', 'ways', 'Bob', 'Corker', 'skewered', 'Donald', 'Trump'], ['Linklater', "'s", 'war', 'veteran', 'comedy', 'speaks', 'to', 'modern', 'America', ',', 'says', 'star'], ['Trump', '’', 's', 'Fight', 'With', 'Corker', 'Jeopardizes', 'His', 'Legislative', 'Agenda']

Right now, each headline is in its own array, within a 2d array. However, when I removed the stopwords,it turns into this:

['Four', 'ways', 'Bob', 'Corker', 'skewered', 'Donald', 'Trump', 'Linklater', "'s", 'war', 'veteran', 'comedy', 'speaks', 'modern', 'America', ',', 'says', 'star', 'Trump', '’', 'Fight', 'With', 'Corker', 'Jeopardizes', 'His', 'Legislative', 'Agenda']

Each word is its own element in a 1-d array. I want to make it so that each headline has its own array, like with the tokenized array. How would I go about doing this?

Here is my code:

data = pd.read_csv("/Users/amanpuranik/Desktop/fake-news-detection/data.csv")
data = data[['Headline', "Label"]]

x = np.array(data['Headline'])
y = np.array(data["Label"])

# tokenization of the data here'
headline_vector = []

for  headline in x:
    headline_vector.append(word_tokenize(headline))

#print(headline_vector)



stopwords = set(stopwords.words('english'))

#removing stopwords at this part
filtered = []

for sentence in headline_vector:
    for word in sentence:
        if word not in stopwords:
            filtered.append(word)

Solution

  • You are iterating over each word and appending them one at a time to the list, which is why it is flattening. Instead of appending each word you need to append the filtered list. This is probably clearer if you do it as a list comprehension:

    headline_vector = [['Four', 'ways', 'Bob', 'Corker', 'skewered', 'Donald', 'Trump'], ['Linklater', "'s", 'war', 'veteran', 'comedy', 'speaks', 'to', 'modern', 'America', ',', 'says', 'star'], ['Trump', '’', 's', 'Fight', 'With', 'Corker', 'Jeopardizes', 'His', 'Legislative', 'Agenda']]
    stopwords = set(["'s", "to", "His", ","])
    
    filtered = [[word for word in sentence if word not in stopwords]
                for sentence in headline_vector]
    

    Results:

    [['Four', 'ways', 'Bob', 'Corker', 'skewered', 'Donald', 'Trump'],
     ['Linklater', 'war','veteran',...]
      ...etc
    ]
    

    You could get the same effect with filter():

    filtered = [list(filter(lambda word: word not in stopwords, sentence))
                for sentence in headline_vector]