In order to be able to decipher a secret message through cryptanalysis, you need to have a sufficient quantity of data to evaluate whether it has been done properly. If all a cryptoanalyst has to work with is enciphered text (say, in the form of an intercepted message) the attempt to decipher it is called a ciphertext-only attack. For a variety of reasons, these are very tricky things to accomplish. The element described below is one of the most basic.
In order to understand why a message of sufficient length is important, consider a message that consists only of a single enciphered phone number: “724-826-5363.” These numbers could have been modified in any of a great number of ways: for instance, adding or subtracting a certain amount from each digit (or alternating between adding and subtracting). Without knowing more, or being willing to test lots of candidate phone numbers, we have no way of learning whether we have deciphered the message properly. On the basis of the ciphertext alone, 835-937-6474 is just as plausible as 502-604-3141.
Obviously, this is only a significant problem for short messages. One could imagine ways in which BHJG could mean ‘HIDE’ or ‘TREE’ or ‘TRAP.’ The use of different keys with the same algorithm could generate any four letter word from that ciphertext. Once we have a long enough enciphered message, however, it becomes a lot more obvious when we have deciphered it properly. If I know that the ciphertext:
has been produced using the Vigenere cipher, and I find that it deciphers to:
when I use the keyword MUSIC, it is highly likely that I have found both the key and the unenciphered text.
This concept is formalized in the idea of unicity distance: invented by Claude Shannon in the 1940s. Unicity distance describes the amount of ciphertext that we must have in order to be confident that we have found the right plaintext. This is a function of two things: the entropy of the plaintext message (something written in proper English is far less random than a phone number) and the length of the key being used for encryption.
To calculate the unicity distance for a mesage written in English, divide the length of the key in bits (say, 128 bits) by 6.8 (which is a measure of the level of redundancy in English). With about eighteen characters of ciphertext, we can be confident that we have found the correct message and not simply one of a number of possibilities, as in the phone number example. By definition, compressed files have redundancy removed; as such, you may want to divide the key length by about 2.5 to get their unicity distance. For truly random data, the level of redundancy is zero therefore the unicity distance is infinite. If I encipher a random number and send it to you, a person who intercepts it will never be able to determine – on the basis of the ciphertext alone – whether they have deciphered it properly.
For many types of data files, the unicity distance is comparable to that in normal English text. This holds for word processor files, spreadsheets, and many databases. Actually, many types of computer files have significantly smaller unicity distances because they have standardized beginnings. If I know that a file sent each morning begins with: “The following the the weather report for…” I can determine very quickly if I have deciphered it correctly.
Actually, the last example is particularly noteworthy. When cryptoanalysts are presented with a piece of ciphertext using a known cipher (say Enigma) and which is known to include a particular string of text (such as the weather report introduction), it can become enormously easier to determine the encryption key being used. These bits of probable texts are called ‘cribs‘ and they played an important role in Allied codebreaking efforts during the Second World War. The use of the German word ‘wetter’ at the same point in messages sent at the same time each day was quite useful for determining what that day’s key was.