Solving the max order problem

Reference: Papadopoulos, A., Roy, P., Pachet, F. Avoiding Plagiarism in Markov Sequence Generation, Proc. of AAAI 2014, Quebec.

The problem: generate non-plagiarising Markov sequences

Many systems use Markov models to automatically generate finite-length sequences that “sound like” or “look like” the content of a training corpus: books from the same authors, songs by a composer, etc. In this context, the order of the model, i.e. the length of contexts used to generate continuations, is traditionally used to control the degree of style imitation: a higher order creates better imitation. However, increasing the Markov order tends to replicate verbatim large “chunks” from the original corpus, producing a basic form of plagiarism. We provide a technique that limits the length of those replicated chunks, thus solving for the first time an old and well-known problem.

We illustrate this phenomenon by generating text using Johnston’s English translation of Pushkin’s Eugene Onegin, an allusion to Markov’s pioneering studies. The first example, generated with a Markov order 1, replicates chunks of size up to 7: “that rogue, backbiter, pantaloon, bribe-taker, glutton and” is taken verbatim from the corpus. The second example, generated with a Markov order 3, replicates chunks of size up to 20 !

Fig. 1: A Markov order 1 sequence, looking mostly like nonsensical junk.
Fig. 1: A Markov order 1 sequence, looking mostly like nonsensical junk.
Fig. 2: Increasing the order (here, to 3) produces arguably better sequences, because they make more sense, but also introduces plagiarism.
Fig. 2: Increasing the order (here, to 3) produces arguably better sequences, because they make more sense, but also introduces plagiarism.

The Solution

The MaxOrder constraint can be used to generate Markov sequences with a guaranteed bound on the size of the largest chunk: the max order. Our technique uses automaton-representations of the space of valid  sequences. This automaton representation is built using a cascade of string matching and path finding algorithms (see paper). Once built, the automaton produces exactly those sequences from the initial Markov model that are of the given max order. One key issue here is not merely to produce this automaton, but to produce it in polynomial time. This is important for building interactive content generation application.

Fig. 3: From a corpus to plagiarism-free style-imitating sequences: the different steps of our solution.
Fig. 3: From a corpus to plagiarism-free style-imitating sequences: the different steps of our solution.

 

Visualising the automata for different max orders gives a hint of the effect of MaxOrder. We show the automaton representations of all sequences in the style of Eugene Onegin, with a Markov order 2, and an increasingly stronger max order: no max order, max order 10 and max order 7.


Fig. 4: The automaton for a Markov order 2, and no max order: ngrams are homoegenously connected.


Fig. 5: The automaton for a Markov order 2, and max order 10: dark connected islands emerge that are loosely interconnected


Fig. 6: The automaton for a Markov order 2, and max order 7: even fewer connected islands are present.

We can further visualise the effect of MaxOrder, by generating sequences for many Markov orders, and showing the size of the chunks composing those sequences. Markov order 1 sequences contain chunks of size mostly 3 or less, and are mostly junk. Increasing the order creates larger chunks, and in the extreme, chunks of size close to 30, with a Markov order 4. With MaxOrder, we can reach a sweet spot in which sequences are both stylistically consistent (no junk sequences) and novel (no plagiarism).

Fig. 7: Illustration of the effect of MaxOrder: combining style imitation and novelty.
Fig. 7: Illustration of the effect of MaxOrder: combining style imitation and novelty.

Examples

The following is an automatically generated sequence, in the style of Eugene Onegin, with a Markov order 2, and an imposed max order 5. A prototype of the sequence generator is available online: http://apijava.flow-machines.com:8080/textgen/

blessedishewho

 

We also integrated these techniques in a music lead sheet generator. In this application, the corpus consists of all the lead sheets of a composer. By imposing a max order, we can ensure plagiarism in the generated lead sheets does not occur.

leadsheet1

Fig. 8: Lead sheets generated with a Markov order 2 and 3
Fig. 8: Lead sheets generated with a Markov order 2 and 3
Fig. 9: Plagiarism is limited to a maximum order 6 in this lead sheet.
Fig. 9: Plagiarism is limited to a maximum order 6 in this lead sheet.
Facebooktwitteryoutube