I'm working with nltk in language portuguese.
That's is my text:
import numpy as np
from nltk.corpus import machado, mac_morpho, floresta, genesis
from nltk.text import Text
ptext1 = Text(machado.words('romance/marm05.txt'), name="Memórias Póstumas de Brás Cubas (1881)")
ptext2 = Text(machado.words('romance/marm08.txt'), name="Dom Casmurro (1899)")
ptext3 = Text(genesis.words('portuguese.txt'), name="Gênesis")
ptext4 = Text(mac_morpho.words('mu94se01.txt'), name="Folha de Sao Paulo (1994)")
Per exemple, i want to divide the ptext4 in sentences and after i want to divide in words:
sentencas = nltk.sent_tokenize(ptext4)
palavras = nltk.word_tokenize(ptext4)
But it doesn't work: The error is expected string or bytes-like object
I tryed this:
sentencas = [row for row in nltk.sent_tokenize(row)]
But the result isen't the expectate:
[In]sentencas
[Out] ['Fujimori']
what can i do, please? I'm new in that.
If you just want the list of words from the machado
corpus, use the .words()
function.
>>> from nltk.corpus import machado
>>> machado.words()
But if you want to process raw text, e.g.
>>> text = machado.raw('romance/marm08.txt')
>>> print(text)
Use this idiom
>>> from nltk import word_tokenize, sent_tokenize
>>> text = machado.raw('romance/marm08.txt')
>>> tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]
And to iterate through the tokenized_text
, which is a list(list(str))
, do this:
>>> for sent in tokenize_text:
... for word in sent:
... print(word)
... break
...