Taps and Swipes: Intuition vs. Machine Learning in UX Design

The list of things you can do with a touchscreen reads like a manual for misbehaving in elementary school: tap, swipe, double-tap, pinch, pull, drag, flick, twist. Our natural affinity for manipulative gestures has helped to transition them from the schoolyard to everyday computing experiences.

But as these gestures become fundamental to mobile interaction, we take their subtle complexities for granted. The difference between a tap and a swipe seems obvious to a human observer, but the distinction isn’t necessarily as clear to a machine, and understanding why can help you create better mobile experiences.

Tap vs Swipe


We’ve thought a lot about taps and swipes when designing the Minuum keyboard, and have fallen in love with the simplicity of this case study in particular.

But let’s take a step back: what happens when you drag your finger along a screen? You get something like this:


Dragging along the screen triggers a set of data points


Touchscreen hardware races to calculate the position of your fingers about 50-200 times per second, reporting back a list of points describing the path that you took. Ideally, a tap on the screen generates a single data point, while a swipe generates a list of data points not unlike the diagram above. The problem? A tap isn’t always quite a tap, and a swipe isn’t always quite a swipe.


A real tap tends to look more like a smudge.


Both taps and swipes end up looking similarly like smudges. This is a problem that we’ve especially had to deal with when designing our touchscreen keyboard, where tapping happens rapidly enough that accidental dragging of taps is inevitable. To top this all off, we also have to deal with:

a) differences between users

b) differences between one-handed/two-handed use

c) different environmental factors (bumpy roads, a night on the town)

To properly distinguish between taps and swipes, we’ll need a simple gesture classification system. We can stick with our intuition, or if we want to get fancy, we can bust out some machine learning algorithms. Let’s break down some approaches you could take.

Intuition, Data Diving, Machine Learning

1. Intuition (not quite enough)

The standard approach here is to simply choose some property of the gesture, like the displacement, velocity, or acceleration. Say we choose velocity – we then pick an arbitrary velocity threshold; above this threshold (fast), a gesture counts as a swipe, and below the threshold (slow) a gesture counts as a tap. We’ll then play around with the threshold until it “feels” right, and be done.

Depending on the ferocity with which you expect users to use your app, this approach may be sufficient for you. But if users will be doing any serious mix of tapping and swiping, this trial-and-error approach won’t quite cut it.


2. Machine Learning (a bit much)

The machine learning approach involves feeding the computer big chunks of raw data, and having it automatically learn the distinction between different types of gestures.

Classification is a textbook problem in machine learning: you have two (or more) categories of things that you are measuring, and you need a way to automatically classify things into those categories. In our case, the things are touchscreen gestures, and the two categories are “taps” and “swipes.”

The advanced developer might be tempted to use linear discriminant analysis, or a host of other algorithms, but is this really necessary in this case?


3. Data Diving (juuuust right)

The dirty secret of machine learning is that intuition plays a huge role. Understanding the data is the most important first step in any advanced approach. Even better, when our data is simple enough, visualizing that data to improve our intuition is really the only step that we need to take. No advanced computer science skills necessary.

Let’s choose two features of our user data which we think might help us to distinguish between taps and swipes: total displacement, and final velocity. We expect swipes to be both longer and faster than taps. We can test this by plotting every gesture as a single point on a graph, with displacement on one axis, and velocity on the other:


Gesture graph of tap and swipe data, plotted as speed vs. distance


We have two main clumps of data that emerge: the clump at the bottom left accounts for the taps (both slow and short), and the big mess in the middle of the graph accounts for all of the variably fast and long swipe gestures.

Creating this kind of graph is a) my favourite thing to do, b) something I highly recommend to every UX designer/developer/researcher, and c) something that I suspect not enough people actually get around to doing.

But how do we use this graph to properly define the difference between taps and swipes? Let’s first draw the simple threshold that we intuitively guessed, choosing a speed as the dividing line.


Gesture graph of tap and swipe data, plotted as speed vs. distance


In the above figure, our speed threshold does a pretty good job of separating the two clusters, but we can see that we’re accidentally treating way too many points at the bottom-middle and bottom-right as belonging to the “tap” cluster, when they clearly represent long strokes which were intended as swipes.

We could instead choose a displacement threshold, as shown below:


Gesture graph of tap and swipe data, plotted as speed vs. distance


This does a better job, but leaves us wishing that some of those very fast but short gestures on the left-hand side could be counted as swipe gestures.

The beautiful aspect of visualizing our data like this, is that we don’t have to constrain ourselves to these boring thresholds; instead we can choose an arbitrary line. So let’s put it at an angle where it feels best:


Gesture graph of tap and swipe data, plotted as speed vs. distance


This line means that any gesture that appears to be fast and short, or slow and long, is clearly a swipe; a gesture that is slow and short is considered a tap. By choosing our diagonal threshold according to two different features of the data (speed and displacement), we were able to find a more complete classifier that prevents fringe swipes from being erroneously detected as taps.

The process of choosing this line is what a machine learning algorithm would have done for us, but our data was simple enough that we didn’t need to bother with writing any code. And of course, we only knew that our data was simple because we bothered to visualize it. Whatever route you take, just don’t forget to take a good look at the data first. Only then should you decide for yourself how aggressively you want to dig into it, and how much you need your algorithms to get involved.


The Next Level

If you really want to go a step further, the natural next step is to have the system adapt to each user on the fly. That’s when you’ll need to bring machine learning into the loop; the algorithms will dynamically move this threshold line around as it learns more about the user’s fluctuating clusters of points. Just don’t forget – you’ll never set those algorithms up properly unless you actually visualize some sample data yourself, first.