Here is the CSV tableThere are two columns in a CSV table. One is summaries and the other one is texts. Both columns were typeOfList before I combined them together, converted to data frame and saved as a CSV file. BTW, the texts in the table have already been cleaned (removed all marks and converted to lower cases):
I want to loop through each cell in the table, split summaries and texts into words and tokenize each word. How can I do it?
I tried with python CSV reader and df.apply(word_tokenize). I tried also newList=set(summaries+texts), but then I could not tokenize them. Any solutions to solve the problem, no matter of using CSV file, data frame or list. Thanks for your help in advance!
note: The real table has more than 50,000 rows.
===some update==
here is the code I have tried.
import pandas as pd
data= pd.read_csv('test.csv')
data.head()
newTry=data.apply(lambda x: " ".join(x), axis=1)
type(newTry)
print (newTry)
import nltk
for sentence in newTry:
new=sentence.split()
print(new)
print(set(new))
Please refer to the output in the screenshot. There are duplicate words in the list, and some square bracket. How should I removed them? I tried with set, but it gives only one sentence value.
You can use built-in csv pacakge to read csv file. And nltk to tokenize words:
from nltk.tokenize import word_tokenize
import csv
words = []
def get_data():
with open("sample_csv.csv", "r") as records:
for record in csv.reader(records):
yield record
data = get_data()
next(data) # skip header
for row in data:
for sent in row:
for word in word_tokenize(sent):
if word not in words:
words.append(word)
print(words)