Search code examples
apinlplinguistics

How to build short sentences with a small letter set restriction?


I'm looking for a way to write a program that creates short german sentences with a restricted letter set. The sentences can be nonsense but should grammatically be correct. The following examples only contain the letters "aeilmnost":

  • "Antonia ist mit Tina im Tal."
  • "Tamina malt mit lila Tinte Enten."
  • "Tina nimmt alle Tomaten mit."

For this task I need a dictionary like this one (found in the answer to "Where can I find a parsable list of German words?"). The research area for programatically create text is NLG - Natural Language Generation. On the NLG-Wiki I found a large table of NLG systems. I picked two from the list, which could be appropriate:

Do you have worked with a NLG library and have some advice which one to use for building short sentences with a letter set restriction? Can you recommend a paper to this topic?


Solution

  • Grammatically correct is a pretty fuzzy area, since grammar is not to strictly defined as one might think. What you really want here though, is a part-of-speech tagger, and a markov chain.

    Specifically a markov chain says that given a certain state (the first word for instance) there's just a certain chance of moving on to another state (the next word). They are relatively easy to write from scracth, but I've got a gist here in python that shows how they work if you want an example.

    Once you've got that I would suggest a part-of-speech-based markov chain, combined with just checking to see if words are constructed from your desired character set. In general the algorithm would go something like this:

    1. Pick first word at random, checking that it is constructed solely from your desired set of characters
    2. Use the Markov Chain to predict the next word
    3. Check if that word is an appropriate part of speech, and that it conforms to the desired character set.
    4. If not, predict another word until it is the case.
    5. If so, then repeat starting at 2 to completion.

    Hope that's what you're looking for. Let me know if you have any more questions.