The index of coincidence

2008-06-06

in Geek stuff, Security

Purple irises

If you are dealing with a long polyalphabetically enciphered message with a short key, the Kasiski examination is an effective mechanism of cryptanalysis. Using repeated sections in the ciphertext, and the assumption that these are often places where the same piece of plaintext was enciphered with the same portion of the key, you can work out the length of the keyword. Then, it is just a matter of dividing the message into X collections of letters (X corresponding to the length of the keyword) and performing a frequency analysis of each. That way, you can identify the cipher alphabet used in each of the encipherments, as well as the keyletter.

If the key is long, however, it may be impossible to get enough letters per alphabet to perform a frequency analysis. Similarly, there may not be enough repetitions in the key to create the pairings Kasiski requires. Here, the clever technique of the index of coincidence may be the answer.

Consider two scenarios, one in which you have two strings of random letters and one in which you have two strings of English:

GKECOAENCYBGDWQMGGRR
VQNWSKXMJWTBKCCMRJUO

TOSTRIVETOSEEKTOFIND
SOWEBEATONBOATSAGAIN

At issue is the number of times letters will match between the top and bottom row. When the strings are random, the chance is always 1/26 or 0.0385. Because some letters in English are more common and some are less common, a match is more likely when using English text. Imagine, for instance, that 75% of the letters in a normal English sentence were ‘E.’ Any two pieces of English text would get a lot of ‘E’ matches. Even if enciphered so that ‘E’ is represented by something else, the number of matches would remain higher than a random sample.

Since polyalphabetic ciphers involve enciphering each letter in a plaintext using a different ciphertext alphabet, an ‘E’ in one part of a ciphertext need not represent an ‘E’ somewhere else. That being said, as long as you line up two ciphertext messages so the letter on top and the letter underneath are using the same alphabet, you will get the same pattern of better-than-random matches for Englist text. Imagine, to begin with, a message enciphered using five different alphabets (1,2,3,4 and 5). Two messages using the same alphabets and key (say, 54321) could be ligned up either in a matching way or in an offset way:

543215432154321
543215432154321

543215432154321
321543215432154

Note that these strings describe the alphabet being used to encipher each plaintext letter, not the letter itself. In the second case, the probability of a match should be essentially random (one property of polyalphabetic ciphers is that they flatten out the distribution of letters from the underlying plaintext). In the second case, we would get the same matching probability as with unenciphered English (0.0667). We can thus take any two messages enciphered with the same key and try shifting them against each other, one letter at a time. When the proportion of matches jumps from about 0.0385 to about 0.0667, we can conclude that the two have been properly matched up. This is true regardless of the length of the key, and can be used with messages that are not of the same length.

This doesn’t actually solve the messages for us, but it goes a long way toward that end. The more messages we can collect and properly align, the more plausible it becomes to crack the entire collection and recover the key. This method was devised by William F. Friedman, possibly America’s greatest cryptographer, and is notable because anybody sufficiently clever could have invented it back when polyalphabetics were first used (16th century or earlier). With computers to do the shifting and statistics for us, the application of the index of coincidence is a powerful method for use against polyalphabetic substitution ciphers, including one time pads where the operators have carelessly recycled sections of the key.

Report a typo or inaccuracy

{ 3 comments… read them below or add one }

Litty June 6, 2008 at 11:13 am

Those are beautiful flowers. Nicely photographed.

robert April 1, 2010 at 3:13 am

In your example NONE of the letters match between the top and
bottom row, so what in the hell are you talking about then??

Milan April 1, 2010 at 8:08 am

I was just illustrating the arrangement of the strings of text. To actually get good statistics, you naturally want longer strings.

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

{ 1 trackback }

Previous post:

Next post: