Search code examples
pythonpandasnlptokenize

Get each unique word in a csv file tokenized


Here is the CSV tableThere are two columns in a CSV table. One is summaries and the other one is texts. Both columns were typeOfList before I combined them together, converted to data frame and saved as a CSV file. BTW, the texts in the table have already been cleaned (removed all marks and converted to lower cases):

I want to loop through each cell in the table, split summaries and texts into words and tokenize each word. How can I do it?

I tried with python CSV reader and df.apply(word_tokenize). I tried also newList=set(summaries+texts), but then I could not tokenize them. Any solutions to solve the problem, no matter of using CSV file, data frame or list. Thanks for your help in advance!

note: The real table has more than 50,000 rows.

===some update==

here is the code I have tried.

import pandas as pd
data= pd.read_csv('test.csv')

data.head()

newTry=data.apply(lambda x: " ".join(x), axis=1)
type(newTry)

print (newTry)

import nltk

for sentence in newTry: 
    new=sentence.split() 

    print(new)
 print(set(new))

enter image description here

Please refer to the output in the screenshot. There are duplicate words in the list, and some square bracket. How should I removed them? I tried with set, but it gives only one sentence value.


Solution

  • You can use built-in csv pacakge to read csv file. And nltk to tokenize words:

    from nltk.tokenize import word_tokenize
    import csv
    
    words = []
    
    def get_data():
        with open("sample_csv.csv", "r") as records:
            for record in csv.reader(records):
                yield record
    
    data = get_data()
    next(data)  # skip header
    
    for row in data:
        for sent in row:
            for word in word_tokenize(sent):
                if word not in words:
                    words.append(word)
    print(words)