Search code examples
pythonnltktokenize

how to tokenize big text in sentences and words


I'm working with nltk in language portuguese.

That's is my text:

import numpy as np 
from nltk.corpus import machado, mac_morpho, floresta, genesis

from nltk.text import Text
ptext1 = Text(machado.words('romance/marm05.txt'), name="Memórias Póstumas de Brás Cubas (1881)")
ptext2 = Text(machado.words('romance/marm08.txt'), name="Dom Casmurro (1899)")
ptext3 = Text(genesis.words('portuguese.txt'), name="Gênesis")
ptext4 = Text(mac_morpho.words('mu94se01.txt'), name="Folha de Sao Paulo (1994)")

Per exemple, i want to divide the ptext4 in sentences and after i want to divide in words:

sentencas = nltk.sent_tokenize(ptext4)
palavras = nltk.word_tokenize(ptext4)

But it doesn't work: The error is expected string or bytes-like object

I tryed this:

sentencas = [row for row in nltk.sent_tokenize(row)]

But the result isen't the expectate:

[In]sentencas
[Out] ['Fujimori']

what can i do, please? I'm new in that.


Solution

  • If you just want the list of words from the machado corpus, use the .words() function.

    >>> from nltk.corpus import machado
    >>> machado.words()
    

    But if you want to process raw text, e.g.

    >>> text = machado.raw('romance/marm08.txt')
    >>> print(text)
    

    Use this idiom

    >>> from nltk import word_tokenize, sent_tokenize
    >>> text = machado.raw('romance/marm08.txt')
    >>> tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]
    

    And to iterate through the tokenized_text, which is a list(list(str)), do this:

    >>> for sent in tokenize_text:
    ...     for word in sent:
    ...         print(word)
    ...     break
    ...