Search code examples
pythoncsvnltktokenizedata-cleaning

How to remove Stopwords from CSV file using NLTK?


Trying to remove stopwords from csv file that has 3 columns and creates a new csv file with the removed stopwords. This is successful however, the data in the new file appears across the top row rather than the columns in the original file.

    import io 
    import codecs
    import csv
    from nltk.corpus import stopwords 
    from nltk.tokenize import word_tokenize 

    stop_words = set(stopwords.words('english')) 
    file1 = codecs.open('soccer.csv','r','utf-8') 
    line = file1.read() 
    words = line.split()
    for r in words: 
        if not r in stop_words: 
            appendFile = open('stopwords_soccer.csv','a', encoding='utf-8') 
            appendFile.write(" "+r)
            appendFile.close()

Solution

  • You need to insert a newline character after writing each line.

    for r in words: 
        if not r in stop_words: 
            appendFile = open('stopwords_soccer.csv','a', encoding='utf-8') 
            appendFile.write(r)
            appendFile.write("\n")
            appendFile.close()
    

    This should solve your issue.