Reference: Papadopoulos, A., Roy, P., Pachet, F. Avoiding Plagiarism in Markov Sequence Generation, Proc. of AAAI 2014, Quebec.
The problem: generate non-plagiarising Markov sequences
Many systems use Markov models to automatically generate finite-length sequences that “sound like” or “look like” the content of a training corpus: books from the same authors, songs by a composer, etc. In this context, the order of the model, i.e. the length of contexts used to generate continuations, is traditionally used to control the degree of style imitation: a higher order creates better imitation. However, increasing the Markov order tends to replicate verbatim large “chunks” from the original corpus, producing a basic form of plagiarism. We provide a technique that limits the length of those replicated chunks, thus solving for the first time an old and well-known problem.
We illustrate this phenomenon by generating text using Johnston’s English translation of Pushkin’s Eugene Onegin, an allusion to Markov’s pioneering studies. The first example, generated with a Markov order 1, replicates chunks of size up to 7: “that rogue, backbiter, pantaloon, bribe-taker, glutton and” is taken verbatim from the corpus. The second example, generated with a Markov order 3, replicates chunks of size up to 20 !
The MaxOrder constraint can be used to generate Markov sequences with a guaranteed bound on the size of the largest chunk: the max order. Our technique uses automaton-representations of the space of valid sequences. This automaton representation is built using a cascade of string matching and path finding algorithms (see paper). Once built, the automaton produces exactly those sequences from the initial Markov model that are of the given max order. One key issue here is not merely to produce this automaton, but to produce it in polynomial time. This is important for building interactive content generation application.
Visualising the automata for different max orders gives a hint of the effect of MaxOrder. We show the automaton representations of all sequences in the style of Eugene Onegin, with a Markov order 2, and an increasingly stronger max order: no max order, max order 10 and max order 7.
Fig. 4: The automaton for a Markov order 2, and no max order: ngrams are homoegenously connected.
Fig. 5: The automaton for a Markov order 2, and max order 10: dark connected islands emerge that are loosely interconnected
Fig. 6: The automaton for a Markov order 2, and max order 7: even fewer connected islands are present.
We can further visualise the effect of MaxOrder, by generating sequences for many Markov orders, and showing the size of the chunks composing those sequences. Markov order 1 sequences contain chunks of size mostly 3 or less, and are mostly junk. Increasing the order creates larger chunks, and in the extreme, chunks of size close to 30, with a Markov order 4. With MaxOrder, we can reach a sweet spot in which sequences are both stylistically consistent (no junk sequences) and novel (no plagiarism).
The following is an automatically generated sequence, in the style of Eugene Onegin, with a Markov order 2, and an imposed max order 5. A prototype of the sequence generator is available online: http://apijava.flow-machines.com:8080/textgen/
We also integrated these techniques in a music lead sheet generator. In this application, the corpus consists of all the lead sheets of a composer. By imposing a max order, we can ensure plagiarism in the generated lead sheets does not occur.