Model Your Users: Algorithms Behind the Minuum Keyboard

When you’re creating a new keyboard technology, there’s a ton of work that goes into both the interaction design, and into the algorithms behind the scenes. While the design of our keyboard is best understood simply by using it, the real “magic” that makes our one-dimensional keyboard possible lies in the statistical algorithms that make it tick.

If you haven’t already seen or used the Minuum keyboard, the brief summary is that we let you compress the conventional keyboard down to just one row of keys, opening up the possibility of typing anywhere where you can measure one dimension of input.

 
keyboard_for_blog_post

By shrinking the keyboard in this way we soon had to grapple with a basic fact: human input is imprecise, and the faster you type the more imprecise it gets. Rather than trying to improve user precision, we instead embrace sloppy typing.

This only works because we use disambiguation in addition to auto-correction. While “auto-correction” implies that you made a mistake that needed correcting, “disambiguation” accepts the fundamental ambiguity of human interaction, and uses an understanding of language to narrow things down. Think of it like speech recognition: in a noisy bar, the problem isn’t that your friends are speaking incorrectly; human speech is ambiguous, and the noisiness of the environment sure doesn’t help. You can only understand them because you have prior knowledge of the sorts of things they are likely to say.

Which leads us into the wonderful world of…

Bayesian statistics!

Minuum combines two factors to evaluate a word, a spatial model which understands how precise you are when you tap on the keyboard (we perform user studies to measure this), and a language model which understands what words you’re likely to use (we build this from huge bodies of real-world text). If you tap on the keyboard five times, and those taps somewhat resemble the word “hello”, we use the following Bayesian equation to test how likely it is that you wanted the word “hello”:

eq

Let’s break that equation down: the probability that you wanted the word “hello” given those taps, is proportional to the product of the spatial and language terms. The spatial term gives the likelihood that wanting to type the word “hello” could have led you to input that sequence of taps; the language term gives the probability that you would ever type the word “hello”.

Minuum’s job is to find the word that maximizes p(word|taps). In the example above, Minuum is generating a score for the word “hello”. To find the best word, Minuum would compare this score to the scores for other words, calculated the same way. The closer your taps are to the correct locations for a given word, the greater the spatial term for that word; the more common a word in English (or French, German, Italian or Spanish if you have one of those languages enabled) the greater the language term.

A simple spatial model

Minuum uses a fairly complicated spatial model (remember the spatial model represents how people tend to actually type on the keyboard). This model can handle many kinds of imprecision, such as extra and omitted characters. A simple model that works surprisingly well, however, is to treat the probability density of a tap as a Gaussian centered at the target character.

This shows that if you mean to type a “t”, the most likely point you tap on the keyboard is right on the “t”, but there is still a significant probability that you tap on a nearby location closer to the “v” or the “g”.

A simple language model

The simplest language model is just a count of word frequencies. Take a large body of text (a corpus), and count how many times each word shows up.

Word Frequency
if 1,115,786
IV 5335

To compare two potential words, say “if” and “IV”, according to the above table “if” is around 200 times more likely to be typed than “IV”.

This simple model, like the simple spatial model, works quite well in practice. Further improvements can come from using context such as the word immediately before the current entry.

Word(s) Frequency
what if 13,207
what of 1,380

The phrase “what if” is about ten times more common than “what of”, so even though “if” and “of” are both very common words, given the context “what”, we can confidently guess that “if” is the intended word.

Words are high-dimensional points

I understand problems best when I can picture them geometrically. My intuitive understanding of the disambiguation problem finally clicked when we had an insight: words are points in high-dimensional space, and typing is a search for those words! Skeptical? Let me explain.
Minuum is a single line, so tapping your finger on Minuum can be represented by one number, In the figure below, for instance, a tap on “q” could clock in between 0 and 0.04, and a tap on “p” at 0.98 to 1.

keyboard_line

A continuum of letters from 0.0 from 1.0

A two-letter word, consists of two taps, and so can be represented as a pair of numbers. The word “an”, typed perfectly, is represented as {0.06, 0.67}, and the word “if” as {0.83, 0.40}. The figure belows shows the positions of some common 2-letter words in this “word space”.

The exact same logic applies to longer words: “and” is {0.06, 0.67, 0.29}, “minuum” is {0.79, 0.83, 0.67, 0.71, 0.71, 0.79}. Above three dimensions, unfortunately, it’s much harder to visualize.

A user’s sequence of taps is also a point in this word space, which we can call the input point. The “closer” a word’s point is to the input point, the higher that word will score in the spatial term of the Bayesian equation above. Odds are, whatever you meant to type is “nearby” to what you actually typed in this space.

So let’s visualize some words!

We can generate a full map of the top two-letter words recommended by Minuum, based on any possible pair of input taps; here, more common words tend to end up with larger areas. By hovering over the graph, you can see what other words would be recommended as alternative candidates.

Two-letter predictions with no context
Two-letter word predictions with previous word “what”


Toggle the context button above to see what happens when we use a better language model to account for the user having previously typed the word “what”. Clearly, “if” is more likely and “in” is less likely to be recommended when we account for context, because “what if” is more common than “what of”, while “what in” is less common than “what I’m”.1

Of course, Minuum uses more context than just the previous word, and also learns your personal typing tendencies over time, so this picture is different for each user.

Statistical modelling for better interfaces

All this complexity allows Minuum to shed some constraints of conventional keyboards (working even as a one-row keyboard on a 1” screen!)

What does this show? That interfaces are better when they understand the user! Google Instant is awesome because it knows what you’re looking for after a couple keystrokes. Siri would be impossible without complex language modeling. Minuum can simplify keyboards only by combining strong spatial and language models of real human input. If you’re dealing with a complex interface, consider how you can statistically model user behaviour to simplify the interaction required.

Want to try it out? Download Minuum for Android
1 Without context, the word “if” has a small area, dominated by the surrounding words “it” and “of”. This is a side-effect of using the QWERTY layout. If it weren’t for the learning curve involved, we could rearrange the keyboard to put the “i” and “o”, as well as the “f” and “t”, very far apart! We’ve actually done this: we have a paper coming out soon. Incidentally, this is also why the Dvorak keyboard layout is exactly the opposite of what you want in a highly ambiguous scenario; Dvorak places all the vowels adjacent to each other, significantly increasing ambiguity. Intuitively it rearranges word-space to put many common words right next to each other.