Search code examples
pythoncsvnlpnltktokenize

how to append tokenized sentences as row to a csv


I am trying to do sentence tokenization several .txt files from a path, and then append each tokenized sentence to a new row with the *.txt document ID as csv.

There are several *txt files in the path (work_dir) In the below example, the first column needs to be the file name (WLTW_5_2016_02_29), and the next column tokenized sentence. such that, if there are 40 sentences in a document, I would expect 40 rows with the same file name in the first column and the second column sentences. I also attached a picture to show how the csv output is expected.

import nltk
work_dir='/content/drive/My Drive/deneme'
filename = 'WLTW_5_2016_02_29.txt'
file = open(filename, 'rt')
text = file.read()
#file.close()
# split into sentences
from nltk import sent_tokenize
sentences = sent_tokenize(text)
print(sentences)
import csv

with open('writeData.csv', mode='w') as file:
    writer = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    writer.writerow(("filename", "sentence"))
    writer.writerow((filename, sentences))

I tried this approach but I couldnot manage it. here

enter image description here

with the above code, it writes everything to the same row. However as seen in the above example, I want to write them to the same column by appending as row.


Solution

  • I think my issue was with the sequence of the code:

    here comes the working one, in case anyone has the same issue, feel free to use it:

    import nltk, glob, csv
    from nltk import sent_tokenize
    files = glob.glob("/content/drive/My Drive/deneme/*.txt")
    
    with open('writeData.csv', mode='w') as new_file:
      writer = csv.writer(new_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
      for filename in files:
    
        # Take all sentences from a given file
        file = open(filename, 'rt')
        text = file.read()
        file.close()
     
        sentences = sent_tokenize(text)
        print(sentences)
    
        for sentence in sentences:
          writer.writerow((filename, sentence))