Mnemonic encoder
The mnemonic encoding presented here is a method for converting binary data into a sequence of words suitable for transmission or storage by voice, handwriting, memorization or other non-computerized means.
The encoding converts 32 bits of data into 3 words from a vocabulary of 1626 words. The words have been chosen to be easy to understand over the phone and recognizable internationally as much as possible. More information about the word list here.
Applications
Using the mnemonic encoding makes it practical to use numbers which are too large to be manually handled otherwise. Below are 64 bits from my /dev/urandom device, in hexadecimal and in mnemonic encoding.
8f9240688685a1e9 magic-slang-crimson--inch-calypso-ibiza
Which one is easier to dictate over the phone? Type into your computer from a piece of paper? Which one is easier to memorize?
The mnemonic encoding makes it practical to manually handle cryptographically significant amounts of random data.
The primary use for which I have compiled this word list is for encoding the hash of a public key. I shall refer to the hash of a public key, encoded into mnemonic encoding as a self-certifying identity. A six-word string is about as easy to remember as a phone number and may be used as an identity reference. The association between this identity string and a public key can be verified mathematically without referring to a trusted third party. In certain cases this may make the use of a certificate authority unnecessary.
This encoding may be used in any application where large numbers need to be handled manually: 128-bit IPv6 addresses, hashes of files, etc.
Similar systems
The concept is similar to the encoding used in the one time password scheme and the biometric word list used on PGPfone and PGP 6.5.
The major difference is in the optimization criteria used in creating the wordlist. The OTP wordlist is designed for easy typing and is optimized for minimum length – the longest word is 4 letters. As a result the OTP wordlist contains very similar words such as “AD” and “ADD”. This would make it a very poor choice if you need to dictate the word sequence over the phone, for example.
The PGPfone wordlist by Philip Zimmermann and Patrick Juola is optimized for maximum phonetic distance between words. See their report for an in-depth discussion of the techniques used. While I admire their effort I believe the result is somewhat disappointing. I believe it suffers from the following problems:
- Low efficiency – only 8 bits are encoded per word and the average word length is 7.6 characters.
- Word quality – contains words that may not be known to non native speakers of English or are simply awkward to use or type: stethoscope, adroitness, undaunted, preshrunk, sociable.
Maximizing the vector distance between valid symbols is a standard technique in error correction codes. When combined with a Maximum Likelyhood Sequence receiver it ensures the lowest error rate against random gaussian noise. Unfortunately, a human listener is not an MLS receiver. Mistaking one word on the list for another word on the list is not the only possible error. A word on the list can also be mistaken for one which is not on the list since the receiver cannot memorize the entire word list and compare any received word against each entry. Maximizing the phonetic distance between words is a valid goal, but it does not help against this problem. I believe that using simple and well-known words is at least as important to reducing the overall error rate.
For comparison, here is the same amount of data (64 bits) encoded in PGPfone, OTP and mnemonic encoding:
PGPfone: egghead combustion quota molecule spaniel molecule fracture Waterloo OTP: AS JOEY NOSE MOLL ED MOVE mnemonic: daniel-ivory-percent--shake-olivia-subway
I believe the wordlist used for this encoding strikes a good compromise between the wordlist length, word length, phonetic distance between words and vocabulary knowledge expected from users.
Download
Version 0.73 is available for download: HTTP
The tar contains a reference implementation of the encoder/decoder in ANSI C and two small sample programs mnencode and mndecode.
To Do
- Improve the wordlist.
- Finish the soundalike/misspelling matcher in the decoder.
Acknowledgements
Alfred V. Aho, Peter J. Weinberger Brian W. Kernighan for awk.
I couldn’t have done this without it.
Neil Haller et al. / One Time Password – RFC 2289
Contributions and comments from Ori Pomerantz, Shear Dar, Yoav Weiss, Julian Noble and others.
Maintainers of the various wordlists and resources I have used in the compilation of the wordlist.
RMS, Linus and the open source community for you-know-what.