Search code examples
pythoncsvnlpreadline

readline only print half of the results in a csv file


As titled, I have a csv file with 6 columns. For NLP processing I need to extract the 6th column(which is a review comment column) and transform it to a list of list of words using NLP.The code below is given by the instructor:

def read_twitter(fname):
    """ Read the given dataset into list and clean stop words. 
    
    Args: 
        fname (string): filename of Twitter Dataset
        
    Returns:
        list of list of words: we view each document as a list, including a list of all words 
    """
    twitter = []
    with open(fname,encoding="utf-8") as f:
        for line in f:
            tweet = f.readline().split(",")[5]
            
            # YOUR CLEANING CODE HERE
            #    - Clean tweet
            #    - Split into list words
            #    - Store list in twitter
            
    return twitter

Then we call the function read_twitter:

twitter = read_twitter('twitter.csv')

It should return some list of lists as required. However, with no codes added to the above part,I'm sure it should return an empty list.But it gives the following error:

IndexError Traceback (most recent call last) in

~\AppData\Local\Temp\ipykernel_15784\2512851317.py in read_twitter(fname)

 12         for line in f:

 13 

---> 14 tweet = f.readline().split(",")[5]

 15 

 16 

IndexError: list index out of range.

But when I tried to edit the above code and change it to:

def read_twitter(fname):
    """ Read the given dataset into list and clean stop words. 
    
    Args: 
        fname (string): filename of Twitter Dataset
        
    Returns:
        list of list of words: we view each document as a list, including a list of all words 
    """
    twitter = []
    with open(fname,encoding="utf-8") as f:
        for line in f:
            print(f.readline().split(",")[5])
            
    return twitter
twitter = read_twitter('twitter.csv')

It actually has the result but includes only half rows of the dataset. I am quite confused on how this readline() function is doing here and why it kept saying out of range. Any help will be appreciated.


Solution

  • You are skipping lines by combining a file iteration and readline. for line in f: iterates one line then tweet = f.readline().split(",")[5] reads the next. Just remove the readline.

    def read_twitter(fname):
        """ Read the given dataset into list and clean stop words. 
        
        Args: 
            fname (string): filename of Twitter Dataset
            
        Returns:
            list of list of words: we view each document as a list, including a list of all words 
        """
        twitter = []
        with open(fname,encoding="utf-8") as f:
            for line in f:
                tweet = line.split(",")[5]
                
                # YOUR CLEANING CODE HERE
                #    - Clean tweet
                #    - Split into list words
                #    - Store list in twitter
                
        return twitter