Search code examples
pythonnlptokenize

How do I tokenize a text data into words and sentences without getting a type error


My end goal is to use NER models to identify custom entities. Before doing this, I am tokenizing the text data into words and sentences. I have a folder of text files(.txt) that I opened and read into Jupyter using the os library. After reading the text file, whenever I try to tokenize the text files, I get a type error. Could please advise on what I am doing wrong? My code is below, Thanks.

import os
outfile = open('result.txt', 'w')
path = "C:/Users/okeke/Documents/Work flow/IT Text analytics Project/Extract/Dubuque_text-nlp"
files = os.listdir(path)
for file in files:
    outfile.write(str(os.stat(path + "/" + file).st_size) + '\n')

outfile.close()

This code runs fine, whenever I run the outfile, I get this below

outfile
<_io.TextIOWrapper name='result.txt' mode='w' encoding='cp1252'>

Next, tokenization.

from nltk.tokenize import sent_tokenize, word_tokenize 
sent_tokens = sent_tokenize(outfile)
print(outfile)

word_tokens = word_tokenize(outfile)
print(outfile

But then I get an error after the running the code above. Check for error below

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-22-62f66183895a> in <module>
      1 from nltk.tokenize import sent_tokenize, word_tokenize
----> 2 sent_tokens = sent_tokenize(outfile)
      3 print(outfile)
      4 
      5 #word_tokens = word_tokenize(text)

~\AppData\Local\Continuum\anaconda3\envs\nlp_course\lib\site-packages\nltk\tokenize\__init__.py in sent_tokenize(text, language)
     93     """
     94     tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
---> 95     return tokenizer.tokenize(text)
     96 
     97 # Standard word tokenizer.
TypeError: expected string or bytes-like object

Solution

  • (moving comment to answer)

    You are trying to process the file object instead of the text in the file. After you create the text file, re-open it and read the entire file before tokenizing.

    Try this code:

    import os
    outfile = open('result.txt', 'w')
    path = "C:/Users/okeke/Documents/Work flow/IT Text analytics Project/Extract/Dubuque_text-nlp"
    files = os.listdir(path)
    for file in files:
        with open(path + "/" + file) as f:
           outfile.write(f.read() + '\n')
           #outfile.write(str(os.stat(path + "/" + file).st_size) + '\n')
    
    outfile.close()  # done writing
    
    
    from nltk.tokenize import sent_tokenize, word_tokenize 
    with open('result.txt') as outfile:  # open for read
       alltext = outfile.read()  # read entire file
       print(alltext)
    
       sent_tokens = sent_tokenize(alltext)  # process file text. tokenize sentences   
       word_tokens = word_tokenize(alltext)  # process file text. tokenize words