Below is pseudo-code to generate random letters (and thus random text) using an order-k Markov model and a training text from which the probabilities are calculated.
seed = random k-character substring from the training text
repeat N times to generate N random letters
for each occurrence of seed in training text
record the letter that follows the occurrence of seed in a list
(or if end-of-file follows, record end-of-file)
choose a random element of the list as the generated letter C
if C is end-of-file, exit loop
print or store C
seed = (last k-1 characters of seed) + C
Here is the Java code implementation of this brute-force approach to generate up to maxLetters letters at random from the training-text in instance variable myString using an order-k Markov model (this is taken directly from MarkovModel.java)
protected String makeNGram(int k, int maxLetters) {
// Appending to StringBuilder is faster than appending to String
StringBuilder build = new StringBuilder();
// Pick a random starting index
int start = myRandom.nextInt(myString.length() - k + 1);
String seed = myString.substring(start, start + k);
ArrayList
// generate at most maxLetters characters
for (int i = 0; i < maxLetters; i++) {
list.clear();
int pos = 0;
while ((pos = myString.indexOf(seed, pos)) != -1 && pos < myString.length()) {
if (pos + k >= myString.length())
list.add((char) 0); //This represents the end of the file
else
list.add(myString.charAt(pos + k));
pos++;
}
int pick = myRandom.nextInt(list.size());
char ch = list.get(pick);
if (ch == 0) //end-of-file
return build.toString();
build.append(ch);
seed = seed.substring(1) + ch;
}
return build.toString();
}
</code>
The code above works, but to generate N letters in a text of size T the code does O(NT) work since it re-scans the text for each generated character. (O(NT) means that the running time is cNT, i.e. within a constant factor proportional to N times T).
###Why maxLetters?
One thing to note about this implementation of Markovian Text Generation is that the length of the generated text can vary. We specify an upper bound for the number of characters via the argument maxLetters
, but if the end-of-file case is reached before we reach the upper bound, we will generate less than maxLetters characters.
This is an implementation choice - there are several ways we could guarantee that a fixed number of characters are generated: