A Brute Force Approach


Below is pseudo-code to generate random letters (and thus random text) using an order-k Markov model and a training text from which the probabilities are calculated.

seed = random k-character substring from the training text

repeat N times to generate N random letters
 for each occurrence of seed in training text
  record the letter that follows the occurrence of seed in a list
  (or if end-of-file follows, record end-of-file)
 choose a random element of the list as the generated letter C
 if C is end-of-file, exit loop
 print or store C
 seed = (last k-1 characters of seed) + C

Here is the Java code implementation of this brute-force approach to generate up to maxLetters letters at random from the training-text in instance variable myString using an order-k Markov model (this is taken directly from MarkovModel.java)

  protected String makeNGram(int k, int maxLetters) {
    // Appending to StringBuilder is faster than appending to String
    StringBuilder build = new StringBuilder();
    // Pick a random starting index
    int start = myRandom.nextInt(myString.length() - k + 1);
    String seed = myString.substring(start, start + k);
    ArrayList list = new ArrayList();
     // generate at most maxLetters characters
    for (int i = 0; i < maxLetters; i++) {
      list.clear();
      int pos = 0;
      while ((pos = myString.indexOf(seed, pos)) != -1 && pos < myString.length()) {
        if (pos + k >= myString.length())
          list.add((char) 0); //This represents the end of the file
        else
          list.add(myString.charAt(pos + k));
        pos++;
      }
      int pick = myRandom.nextInt(list.size());
      char ch = list.get(pick);
      if (ch == 0) //end-of-file
        return build.toString();
      build.append(ch);
      seed = seed.substring(1) + ch;
    }
    return build.toString();
  }

</code>

The code above works, but to generate N letters in a text of size T the code does O(NT) work since it re-scans the text for each generated character. (O(NT) means that the running time is cNT, i.e. within a constant factor proportional to N times T).

###Why maxLetters?

One thing to note about this implementation of Markovian Text Generation is that the length of the generated text can vary. We specify an upper bound for the number of characters via the argument maxLetters, but if the end-of-file case is reached before we reach the upper bound, we will generate less than maxLetters characters.

This is an implementation choice - there are several ways we could guarantee that a fixed number of characters are generated:

  • We could, for example, choose to use a "wraparound" to treat the start of the source as following the end of the source. However, this can result in us generating k-grams that don't appear in the original text, which contradicts the idea that the text we generate should resemble the source text.
  • We could try to only pick kgrams which have the potential to generate the desired length of text. However, this would involve using graph theory and turning this simple algorithm into a more complicated one.
  • We could simply discard attempts at generating text until we make an attempt which generates the desired number of characters. However, as long as there is some possibility of not generating the desired number of characters, this method has the potential to go on forever (as we could end up failing over and over again) Thus, while we could ensure a fixed number of characters, it is much simpler and makes more sense within the scope of Markovian Text Generation not to.