Exploring Emoji: The Quest for the Perfect Emoticon

Emoji are about more than just being cute, or funny; they’re a way of expressing tone and mood, clarifying sarcasm and softening harsh replies. People are always finding new ways to use emoji, forever in search of the perfect emoji for any given situation. Take “camping”, for example: you might already know about the “tent” emoji, but digging deeper reveals many more that capture the nuances of what you love most about camping:

The Perfect Emoji for Camping

We dove deep into hundreds of millions of messages containing emoji, to uncover the fundamentals of how emoji are used:

The fundamentals of emoji structure; this is how people use emoji!

The problem is that emoji keyboards are horrifically inconvenient. People don’t have time to browse through hundreds (soon thousands) of miscategorizated emoji, and most people aren’t even aware of half the emoji they have available to them. Too many people limit themselves to the occasional single smiley face, and stop there.

So how can we bring the full potential of emoji to users who don’t have time to explore? How can we make emoji easier?

Minuum saves you from the pains of scrolling through miscategorized lists of emoji

With massive data processing of real emoji use (more detail to come in upcoming posts, I promise!), we’re building hyper-intelligent algorithms that give you the perfect set of emoji for every situation, immediately.

Minuum’s new emoji suggestions can be used as search, translation, and prediction, all wrapped-up into the core typing experience. It’s about quick access to the emoji you need, but most importantly, it’s about serendipity. Whether you’re discovering a single emoji, or a set of emoji that work well together, we’ll constantly surprise you with emoji you didn’t even know you wanted.

Here are examples of some fun combinations created from the first version of Minuum’s smart emoji suggestions:

Example of Minuum Smart Emoji Predictions

As we constantly learn from how you use emoji, the palette of suggestions will evolve until you hopefully never need to dig through emoji categories again.

You can try for yourself: Minuum’s smart emoji suggestions are available today for Android on Google Play, and for beta testers on iOS (you can sign up for our iOS mailing list to be notified when this feature becomes publicly available in the next week or two). As always, we’d love to hear your feedback @minuum, and stay tuned for more in this series on emoji exploration.

Smart Emoji is the first of many innovations that we have planned here at Minuum as we continue our long journey of applying semantic analysis, natural language processing, and machine learning to make mobile input better.

Side By Side iPhone 6+ Nexus 5 with Minuum Emoji Predictions

g+ f t

Taps and Swipes: Intuition vs. Machine Learning in UX Design

The list of things you can do with a touchscreen reads like a manual for misbehaving in elementary school: tap, swipe, double-tap, pinch, pull, drag, flick, twist. Our natural affinity for manipulative gestures has helped to transition them from the schoolyard to everyday computing experiences.

But as these gestures become fundamental to mobile interaction, we take their subtle complexities for granted. The difference between a tap and a swipe seems obvious to a human observer, but the distinction isn’t necessarily as clear to a machine, and understanding why can help you create better mobile experiences.

Tap vs Swipe


We’ve thought a lot about taps and swipes when designing the Minuum keyboard, and have fallen in love with the simplicity of this case study in particular.

But let’s take a step back: what happens when you drag your finger along a screen? You get something like this:


Dragging along the screen triggers a set of data points


Touchscreen hardware races to calculate the position of your fingers about 50-200 times per second, reporting back a list of points describing the path that you took. Ideally, a tap on the screen generates a single data point, while a swipe generates a list of data points not unlike the diagram above. The problem? A tap isn’t always quite a tap, and a swipe isn’t always quite a swipe.


A real tap tends to look more like a smudge.


Both taps and swipes end up looking similarly like smudges. This is a problem that we’ve especially had to deal with when designing our touchscreen keyboard, where tapping happens rapidly enough that accidental dragging of taps is inevitable. To top this all off, we also have to deal with:

a) differences between users

b) differences between one-handed/two-handed use

c) different environmental factors (bumpy roads, a night on the town)

To properly distinguish between taps and swipes, we’ll need a simple gesture classification system. We can stick with our intuition, or if we want to get fancy, we can bust out some machine learning algorithms. Let’s break down some approaches you could take.

Intuition, Data Diving, Machine Learning

1. Intuition (not quite enough)

The standard approach here is to simply choose some property of the gesture, like the displacement, velocity, or acceleration. Say we choose velocity – we then pick an arbitrary velocity threshold; above this threshold (fast), a gesture counts as a swipe, and below the threshold (slow) a gesture counts as a tap. We’ll then play around with the threshold until it “feels” right, and be done.

Depending on the ferocity with which you expect users to use your app, this approach may be sufficient for you. But if users will be doing any serious mix of tapping and swiping, this trial-and-error approach won’t quite cut it.


2. Machine Learning (a bit much)

The machine learning approach involves feeding the computer big chunks of raw data, and having it automatically learn the distinction between different types of gestures.

Classification is a textbook problem in machine learning: you have two (or more) categories of things that you are measuring, and you need a way to automatically classify things into those categories. In our case, the things are touchscreen gestures, and the two categories are “taps” and “swipes.”

The advanced developer might be tempted to use linear discriminant analysis, or a host of other algorithms, but is this really necessary in this case?


3. Data Diving (juuuust right)

The dirty secret of machine learning is that intuition plays a huge role. Understanding the data is the most important first step in any advanced approach. Even better, when our data is simple enough, visualizing that data to improve our intuition is really the only step that we need to take. No advanced computer science skills necessary.

Let’s choose two features of our user data which we think might help us to distinguish between taps and swipes: total displacement, and final velocity. We expect swipes to be both longer and faster than taps. We can test this by plotting every gesture as a single point on a graph, with displacement on one axis, and velocity on the other:


Gesture graph of tap and swipe data, plotted as speed vs. distance


We have two main clumps of data that emerge: the clump at the bottom left accounts for the taps (both slow and short), and the big mess in the middle of the graph accounts for all of the variably fast and long swipe gestures.

Creating this kind of graph is a) my favourite thing to do, b) something I highly recommend to every UX designer/developer/researcher, and c) something that I suspect not enough people actually get around to doing.

But how do we use this graph to properly define the difference between taps and swipes? Let’s first draw the simple threshold that we intuitively guessed, choosing a speed as the dividing line.


Gesture graph of tap and swipe data, plotted as speed vs. distance


In the above figure, our speed threshold does a pretty good job of separating the two clusters, but we can see that we’re accidentally treating way too many points at the bottom-middle and bottom-right as belonging to the “tap” cluster, when they clearly represent long strokes which were intended as swipes.

We could instead choose a displacement threshold, as shown below:


Gesture graph of tap and swipe data, plotted as speed vs. distance


This does a better job, but leaves us wishing that some of those very fast but short gestures on the left-hand side could be counted as swipe gestures.

The beautiful aspect of visualizing our data like this, is that we don’t have to constrain ourselves to these boring thresholds; instead we can choose an arbitrary line. So let’s put it at an angle where it feels best:


Gesture graph of tap and swipe data, plotted as speed vs. distance


This line means that any gesture that appears to be fast and short, or slow and long, is clearly a swipe; a gesture that is slow and short is considered a tap. By choosing our diagonal threshold according to two different features of the data (speed and displacement), we were able to find a more complete classifier that prevents fringe swipes from being erroneously detected as taps.

The process of choosing this line is what a machine learning algorithm would have done for us, but our data was simple enough that we didn’t need to bother with writing any code. And of course, we only knew that our data was simple because we bothered to visualize it. Whatever route you take, just don’t forget to take a good look at the data first. Only then should you decide for yourself how aggressively you want to dig into it, and how much you need your algorithms to get involved.


The Next Level

If you really want to go a step further, the natural next step is to have the system adapt to each user on the fly. That’s when you’ll need to bring machine learning into the loop; the algorithms will dynamically move this threshold line around as it learns more about the user’s fluctuating clusters of points. Just don’t forget – you’ll never set those algorithms up properly unless you actually visualize some sample data yourself, first.

g+ f t

Model Your Users: Algorithms Behind the Minuum Keyboard

When you’re creating a new keyboard technology, there’s a ton of work that goes into both the interaction design, and into the algorithms behind the scenes. While the design of our keyboard is best understood simply by using it, the real “magic” that makes our one-dimensional keyboard possible lies in the statistical algorithms that make it tick.

If you haven’t already seen or used the Minuum keyboard, the brief summary is that we let you compress the conventional keyboard down to just one row of keys, opening up the possibility of typing anywhere where you can measure one dimension of input.


By shrinking the keyboard in this way we soon had to grapple with a basic fact: human input is imprecise, and the faster you type the more imprecise it gets. Rather than trying to improve user precision, we instead embrace sloppy typing.

This only works because we use disambiguation in addition to auto-correction. While “auto-correction” implies that you made a mistake that needed correcting, “disambiguation” accepts the fundamental ambiguity of human interaction, and uses an understanding of language to narrow things down. Think of it like speech recognition: in a noisy bar, the problem isn’t that your friends are speaking incorrectly; human speech is ambiguous, and the noisiness of the environment sure doesn’t help. You can only understand them because you have prior knowledge of the sorts of things they are likely to say.

Which leads us into the wonderful world of…

Bayesian statistics!

Minuum combines two factors to evaluate a word, a spatial model which understands how precise you are when you tap on the keyboard (we perform user studies to measure this), and a language model which understands what words you’re likely to use (we build this from huge bodies of real-world text). If you tap on the keyboard five times, and those taps somewhat resemble the word “hello”, we use the following Bayesian equation to test how likely it is that you wanted the word “hello”:


Let’s break that equation down: the probability that you wanted the word “hello” given those taps, is proportional to the product of the spatial and language terms. The spatial term gives the likelihood that wanting to type the word “hello” could have led you to input that sequence of taps; the language term gives the probability that you would ever type the word “hello”.

Minuum’s job is to find the word that maximizes p(word|taps). In the example above, Minuum is generating a score for the word “hello”. To find the best word, Minuum would compare this score to the scores for other words, calculated the same way. The closer your taps are to the correct locations for a given word, the greater the spatial term for that word; the more common a word in English (or French, German, Italian or Spanish if you have one of those languages enabled) the greater the language term.

A simple spatial model

Minuum uses a fairly complicated spatial model (remember the spatial model represents how people tend to actually type on the keyboard). This model can handle many kinds of imprecision, such as extra and omitted characters. A simple model that works surprisingly well, however, is to treat the probability density of a tap as a Gaussian centered at the target character.

This shows that if you mean to type a “t”, the most likely point you tap on the keyboard is right on the “t”, but there is still a significant probability that you tap on a nearby location closer to the “v” or the “g”.

A simple language model

The simplest language model is just a count of word frequencies. Take a large body of text (a corpus), and count how many times each word shows up.

Word Frequency
if 1,115,786
IV 5335

To compare two potential words, say “if” and “IV”, according to the above table “if” is around 200 times more likely to be typed than “IV”.

This simple model, like the simple spatial model, works quite well in practice. Further improvements can come from using context such as the word immediately before the current entry.

Word(s) Frequency
what if 13,207
what of 1,380

The phrase “what if” is about ten times more common than “what of”, so even though “if” and “of” are both very common words, given the context “what”, we can confidently guess that “if” is the intended word.

Words are high-dimensional points

I understand problems best when I can picture them geometrically. My intuitive understanding of the disambiguation problem finally clicked when we had an insight: words are points in high-dimensional space, and typing is a search for those words! Skeptical? Let me explain.
Minuum is a single line, so tapping your finger on Minuum can be represented by one number, In the figure below, for instance, a tap on “q” could clock in between 0 and 0.04, and a tap on “p” at 0.98 to 1.


A continuum of letters from 0.0 from 1.0

A two-letter word, consists of two taps, and so can be represented as a pair of numbers. The word “an”, typed perfectly, is represented as {0.06, 0.67}, and the word “if” as {0.83, 0.40}. The figure belows shows the positions of some common 2-letter words in this “word space”.

The exact same logic applies to longer words: “and” is {0.06, 0.67, 0.29}, “minuum” is {0.79, 0.83, 0.67, 0.71, 0.71, 0.79}. Above three dimensions, unfortunately, it’s much harder to visualize.

A user’s sequence of taps is also a point in this word space, which we can call the input point. The “closer” a word’s point is to the input point, the higher that word will score in the spatial term of the Bayesian equation above. Odds are, whatever you meant to type is “nearby” to what you actually typed in this space.

So let’s visualize some words!

We can generate a full map of the top two-letter words recommended by Minuum, based on any possible pair of input taps; here, more common words tend to end up with larger areas. By hovering over the graph, you can see what other words would be recommended as alternative candidates.

Two-letter predictions with no context
Two-letter word predictions with previous word “what”

Toggle the context button above to see what happens when we use a better language model to account for the user having previously typed the word “what”. Clearly, “if” is more likely and “in” is less likely to be recommended when we account for context, because “what if” is more common than “what of”, while “what in” is less common than “what I’m”.1

Of course, Minuum uses more context than just the previous word, and also learns your personal typing tendencies over time, so this picture is different for each user.

Statistical modelling for better interfaces

All this complexity allows Minuum to shed some constraints of conventional keyboards (working even as a one-row keyboard on a 1” screen!)

What does this show? That interfaces are better when they understand the user! Google Instant is awesome because it knows what you’re looking for after a couple keystrokes. Siri would be impossible without complex language modeling. Minuum can simplify keyboards only by combining strong spatial and language models of real human input. If you’re dealing with a complex interface, consider how you can statistically model user behaviour to simplify the interaction required.

Want to try it out? Download Minuum for Android
1 Without context, the word “if” has a small area, dominated by the surrounding words “it” and “of”. This is a side-effect of using the QWERTY layout. If it weren’t for the learning curve involved, we could rearrange the keyboard to put the “i” and “o”, as well as the “f” and “t”, very far apart! We’ve actually done this: we have a paper coming out soon. Incidentally, this is also why the Dvorak keyboard layout is exactly the opposite of what you want in a highly ambiguous scenario; Dvorak places all the vowels adjacent to each other, significantly increasing ambiguity. Intuitively it rearranges word-space to put many common words right next to each other.
g+ f t