I am trying to solve the problem of matching a human generated gesture with a known gesture. The human generated gesture will be represented by a sequence of points that will need to be interpolated into a path and compared to the existing path. The image below shows what I am trying to compare
Can you please help point me in the right direction with resources or concepts that I can read into to build an algorithm to match these two paths? I have no experience in doing this before so any insights will be appreciated.
Measure input on some interval. Every xx milliseconds, measure the coordinates of the user's hand/finger/stylus.
Modify the pattern. It's currently a continuous "function," but measuring input as such is difficult. Use discrete points at some interval. This interval can be very short, depending on how accurate you require gestures to be. In fact, it should be very short; the more points to compare against, the better (I'll explain this a little better in the next section).
When input is measured, the input-measurement interval needs to be short enough that each received consecutive pair of input points is close enough to compare to the expected points.
Imagine that the user performs some gesture very quickly (and completes it in the time your input-reader reads only three frames). The pattern and input cannot be reliably compared:
To avoid this, your input-reader must have a relatively short interval. However, this probably isn't a huge concern, since most hardware can read even the fastest human gestures.
Back to patterns: they should always be detailed enough to include more points than any possible input. More expected points allow for better accuracy. If a user moves slowly, the input will have more points; if they move quickly, the input will have fewer.
Consider this: completing a single gesture gives you half as many input frames as the pattern includes. The user has moved at a "normal" speed, so, to simplify the algorithm, you can "dumb down" your pattern by a factor of 2, then compare input coordinates to pattern coordinates directly.
This method is easier than the alternative that comes to mind (see next section).
If you have a small number of expected points, you'll have to make approximations to match input.
Here's an "extreme" example, but it proves the concept. Given this pattern and input:
Point 3r can't be reliably compared with point 2 or point 3, so you'd have to use some function of points 2, 3, and 3r, to determine if 3r is on the correct path. Now consider the same input, but where the pattern has higher density:
Now, you don't have to compromise, since 3r is essentially definitely on the gesture's pattern. A slight reduction in the pattern's density causes it to match input quite well.
Instead of comparing absolute positions (such as on a touchscreen), you probably want the gesture to be allowed anywhere in some plane of space. To that end, you must relate the start point of the input to some coordinate system.
To be user-friendly, allow gestures to be done in a range of "sizes". You don't want to compare raw data, because chances are the size of the plane of the input doesn't match the size of the plane of the pattern.
Normalize the input in the x- and y-direction to match the size of your pattern. Do not maintain aspect ratio.
RecMaxH
and RecMaxV
)ExpMaxH
and ExpMaxV
)ExpMaxH/RecMaxH
ExpMaxV/RecMaxV
You now have two more-similar sets of points that can be compared. Normalization can be much more detailed than this; for instance, you could normalize sets of 3 points at a time to get incredibly similar images (but you would probably have to do this for each pattern, then compare the sum of all differences to find the most likely matching pattern).
I suggest storing all gestures' pattern as a graph the same size; that reduces computation when measuring closeness of input to possible pattern matches.
Imagine a button that, when clicked/activated, causes your program to begin measuring inputs. This would be similar to Google's Voice Search, which doesn't constantly record and search; instead, you say "Ok Jarvis" or click the handy microphone icon and begin speaking your query.
Benefits:
Drawbacks:
If you're writing, for instance, a gesture-search (ridiculous example), this is probably the better method to implement. Nobody wants every move they make interpreted as an action in your application. However, if you're writing a Kinect-style or gesture-based game, you probably want to be constantly recording and looking for gestures.
Your program constatly records gesture coordinates at the specified interval (this could be reduced to "records if there's movement, otherwise doesn't store coordinates"). You must make a decision: how many "frames" will you record until deciding that the currently-stored motion is not a recognized gesture?
Store coordinates in a buffer: a queue 1.5 or 2 (to be cautious) times as long as the largest number of frames you're willing to record.
Once you determine that there exists in this buffer a sequence of frames that match a pattern, execute that gesture's result, and clear the queue.
If there's the possibility that the next gesture is an "option" for the most-recent gesture, record the application state as "currently waiting on option for ____ gesture," and wait for the option to appear.
If it's determined that the first x frames in the buffer cannot possibly match a pattern (because of their sequence or positioning), remove them from the queue.
Benefits:
Drawbacks:
If you're writing a game that runs based on real-time input, this is probably the right choice.
If you're using user-driven recognition:
If you're using constant measuring:
In your buffer, treat the sequence of max_sequence_size (you decide) beginning at every multiple of frame_multiples (you decide) as a possible gesture. For instance, if all of my possible gestures are at most 20 frames long, and I believe every 5 frames a new gesture could be starting (and I won't lose any critical data in those 5 frames), I'll compare each portions of the buffer to all possible gestures (portions from 0-19, 5-24, 10-29, etc.). This is heavier computing when frame_multiples decreases. For perfect measurement, frame_multiples is 1 (but this is likely not reasonable).
Hope you've enjoyed reading this answer as much as I enjoyed writing it. I've never done this before, but you've piqued my interest in a way that doesn't often happen. Please edit and improve my answer! If there's a portion that seems incomplete, add to it. I'm very curious in (particularly, more-experienced) responses and criticism.