how to tokenize big text in sentences and words

I'm working with nltk in language portuguese.

That's is my text:

import numpy as np 
from nltk.corpus import machado, mac_morpho, floresta, genesis

from nltk.text import Text
ptext1 = Text(machado.words('romance/marm05.txt'), name="Memórias Póstumas de Brás Cubas (1881)")
ptext2 = Text(machado.words('romance/marm08.txt'), name="Dom Casmurro (1899)")
ptext3 = Text(genesis.words('portuguese.txt'), name="Gênesis")
ptext4 = Text(mac_morpho.words('mu94se01.txt'), name="Folha de Sao Paulo (1994)")

Per exemple, i want to divide the ptext4 in sentences and after i want to divide in words:

sentencas = nltk.sent_tokenize(ptext4)
palavras = nltk.word_tokenize(ptext4)

But it doesn't work: The error is expected string or bytes-like object

I tryed this:

sentencas = [row for row in nltk.sent_tokenize(row)]

But the result isen't the expectate:

[In]sentencas
[Out] ['Fujimori']

what can i do, please? I'm new in that.

Solution

If you just want the list of words from the machado corpus, use the .words() function.

>>> from nltk.corpus import machado
>>> machado.words()

But if you want to process raw text, e.g.

>>> text = machado.raw('romance/marm08.txt')
>>> print(text)

Use this idiom

>>> from nltk import word_tokenize, sent_tokenize
>>> text = machado.raw('romance/marm08.txt')
>>> tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]

And to iterate through the tokenized_text, which is a list(list(str)), do this:

>>> for sent in tokenize_text:
...     for word in sent:
...         print(word)
...     break
...