As titled, I have a csv file with 6 columns. For NLP processing I need to extract the 6th column(which is a review comment column) and transform it to a list of list of words using NLP.The code below is given by the instructor:
def read_twitter(fname):
""" Read the given dataset into list and clean stop words.
Args:
fname (string): filename of Twitter Dataset
Returns:
list of list of words: we view each document as a list, including a list of all words
"""
twitter = []
with open(fname,encoding="utf-8") as f:
for line in f:
tweet = f.readline().split(",")[5]
# YOUR CLEANING CODE HERE
# - Clean tweet
# - Split into list words
# - Store list in twitter
return twitter
Then we call the function read_twitter:
twitter = read_twitter('twitter.csv')
It should return some list of lists as required. However, with no codes added to the above part,I'm sure it should return an empty list.But it gives the following error:
IndexError Traceback (most recent call last) in
~\AppData\Local\Temp\ipykernel_15784\2512851317.py in read_twitter(fname)
12 for line in f:
13
---> 14 tweet = f.readline().split(",")[5]
15
16
IndexError: list index out of range.
But when I tried to edit the above code and change it to:
def read_twitter(fname):
""" Read the given dataset into list and clean stop words.
Args:
fname (string): filename of Twitter Dataset
Returns:
list of list of words: we view each document as a list, including a list of all words
"""
twitter = []
with open(fname,encoding="utf-8") as f:
for line in f:
print(f.readline().split(",")[5])
return twitter
twitter = read_twitter('twitter.csv')
It actually has the result but includes only half rows of the dataset. I am quite confused on how this readline() function is doing here and why it kept saying out of range. Any help will be appreciated.
You are skipping lines by combining a file iteration and readline. for line in f:
iterates one line then tweet = f.readline().split(",")[5]
reads the next. Just remove the readline.
def read_twitter(fname):
""" Read the given dataset into list and clean stop words.
Args:
fname (string): filename of Twitter Dataset
Returns:
list of list of words: we view each document as a list, including a list of all words
"""
twitter = []
with open(fname,encoding="utf-8") as f:
for line in f:
tweet = line.split(",")[5]
# YOUR CLEANING CODE HERE
# - Clean tweet
# - Split into list words
# - Store list in twitter
return twitter