Search code examples
markov

Markov Algorithm for Random Writing


I got a litte problem understanding conceptually the structure of a random writing program (that takes input in form of a text file) and uses the Markov algorithm to create a somewhat sensible output.

So the data structure i am using is to use cases ranging from 0-10. Where at case 0: I count the number a letter/symbol or digit appears and base my new text on this to simulate the input. I have already implemented this by using an Map type that holds each unique letter in the input text and a array of how many there are in the text. So I can simply ask for the size of the array for the specific letter and create output text easy like this.

But now I Need to create case1/2/3 and so on... case 1 also holds what letter is most likely to appear after any letter aswell. Do i need to create 10 seperate arrays for these cases, or are there an easier way?


Solution

  • There are a lot of ways to model this. One approach is as you describe, with an multi-dimensional array where each index is the following character in the chain and the final result is the count.

    # Two character sample:
    int counts[][] = new int[26][26]
    # ... initialize all entries to zero
    
    # 'a' => 0, 'b' => 1, ... 'z' => 25
    # For example for the string 'apple'
    # Note: I'm only writing this like this to show what the result is, it should be in a
    #       loop or function ...
    counts['a'-'a']['p'-'a']++
    counts['p'-'a']['p'-'a']++
    counts['p'-'a']['l'-'a']++
    counts['l'-'a']['l'-'e']++
    

    Then to randomly generate names you would count the number of total outcomes for a given character (ex: 2 outcomes for 'p' in the previous example) and pick a weighted random number for one of the possible outcomes.

    For smaller sizes (say up to 4 characters) that should work fine. For anything larger you may start to run into memory issues since (assuming you're using A-Z) 26^N entries for an N-length chain.

    I wrote something like a couple of years ago. I think I used random pages from Wikipedia to for seed data to generate the weights.