Get each unique word in a csv file tokenized

Here is the CSV tableThere are two columns in a CSV table. One is summaries and the other one is texts. Both columns were typeOfList before I combined them together, converted to data frame and saved as a CSV file. BTW, the texts in the table have already been cleaned (removed all marks and converted to lower cases):

I want to loop through each cell in the table, split summaries and texts into words and tokenize each word. How can I do it?

I tried with python CSV reader and df.apply(word_tokenize). I tried also newList=set(summaries+texts), but then I could not tokenize them. Any solutions to solve the problem, no matter of using CSV file, data frame or list. Thanks for your help in advance!

note: The real table has more than 50,000 rows.

===some update==

here is the code I have tried.

import pandas as pd
data= pd.read_csv('test.csv')

data.head()

newTry=data.apply(lambda x: " ".join(x), axis=1)
type(newTry)

print (newTry)

import nltk

for sentence in newTry: 
    new=sentence.split() 

    print(new)
 print(set(new))

enter image description here

Please refer to the output in the screenshot. There are duplicate words in the list, and some square bracket. How should I removed them? I tried with set, but it gives only one sentence value.

Solution

You can use built-in csv pacakge to read csv file. And nltk to tokenize words:

from nltk.tokenize import word_tokenize
import csv

words = []

def get_data():
    with open("sample_csv.csv", "r") as records:
        for record in csv.reader(records):
            yield record

data = get_data()
next(data)  # skip header

for row in data:
    for sent in row:
        for word in word_tokenize(sent):
            if word not in words:
                words.append(word)
print(words)