Search code examples
csvnlpnltktxtsentence

Splitting sentences from a .txt file to .csv using NLTK


I have a corpus of newspaper articles in a .txt file, and I'm trying to split the sentences from it to a .csv in order to annotate each sentence.

I was told to use NLTK for this purpose, and I found the following code for sentence splitting:

import nltk

from nltk.tokenize import sent_tokenize

sent_tokenize("Here is my first sentence. And that's a second one.")

However, I'm wondering:

  1. How does one use a .txt file as an input for the tokenizer (so that I don't have to just copy and paste everything), and
  2. How does one output a .csv file instead of just printing the sentences in my terminal.

Solution

  • Reading a .txt file & tokenizing its sentences

    Assuming the .txt file is located in the same folder as your Python script, you can read a .txt file and tokenize the sentences using NLTK as shown below:

    from nltk import sent_tokenize
    
    with open("myfile.txt") as file:
        textFile = file.read()
    
    tokenTextList = sent_tokenize(textFile)
    print(tokenTextList)
    # Output: ['Here is my first sentence.', "And that's a second one."]
    

    Writing a list of sentence tokens to .csv file

    There are a number of options for writing a .csv file. Pick whichever is more convenient (e.g. if you already have pandas loaded, use the pandas option).

    To write a .csv file using the pandas module:

    import pandas as pd
    
    df = pd.DataFrame(tokenTextList)
    df.to_csv("myCSVfile.csv", index=False, header=False)
    

    To write a .csv file using the numpy module:

    import numpy as np
    
    np.savetxt("myCSVfile.csv", tokenTextList, delimiter=",", fmt="%s")
    

    To write a .csv file using the csv module:

    import csv
    
    with open('myCSVfile.csv', 'w', newline='') as file:
        write = csv.writer(file, lineterminator='\n')
        # write.writerows([tokenTextList])
        write.writerows([[token] for token in tokenTextList]) # For pandas style output