I've been working with NLTK for the past three days to get familiar and reading the "Natural Language processing" book to understand what's going on. I'm curious if someone could clarify for me the following:
Note that the first time you run this command, it is slow because it gathers statistics about word sequences. Each time you run it, you will get different output text. Now try generating random text in the style of an inaugural address or an Internet chat room. Although the text is random, it re-uses common words and phrases from the source text and gives us a sense of its style and content. (What is lacking in this randomly generated text?)
This part of the text, chapter 1, simply says that it "gathers statistics" and it will get "different output text"
What specifically does generate do and how does it work?
This example of generate()
uses text3, which is the Bible's Genesis:
In the beginning , between me and thee and in the garden thou mayest come in unto Noah into the ark , and Mibsam , And said , Is there yet any portion or inheritance for us , and make thee as Ephraim and as the sand of the dukes that came with her ; and they were come . Also he sent forth the dove out of thee , with tabret , and wept upon them greatly ; and she conceived , and called their names , by their names after the end of the womb ? And he
Here, the generate()
function seems to simply output phrases created by cutting off text at punctuation and randomly reassembling it but it has a bit of readability to it.
type(text3)
will tell you that text3 is of type nltk.text.Text
.
To cite the documentation of Text.generate()
:
Print random text, generated using a trigram language model.
That means that NLTK has created an N-Gram model for the Genesis text, counting each occurence of sequences of three words so that it can predict the most likely successor of any given two words in this text. N-Gram models will be explained in more detail in chapter 5 of the NLTK book.
See also the answers to this question.