Search code examples
algorithmgesturegesture-recognition

Gesture detection algorithm based on discrete points


I am trying to solve the problem of matching a human generated gesture with a known gesture. The human generated gesture will be represented by a sequence of points that will need to be interpolated into a path and compared to the existing path. The image below shows what I am trying to compare enter image description here

Can you please help point me in the right direction with resources or concepts that I can read into to build an algorithm to match these two paths? I have no experience in doing this before so any insights will be appreciated.


Solution

  • Receiving input

    Measure input on some interval. Every xx milliseconds, measure the coordinates of the user's hand/finger/stylus.


    Storing patterns and input

    Patterns (expected input)

    Modify the pattern. It's currently a continuous "function," but measuring input as such is difficult. Use discrete points at some interval. This interval can be very short, depending on how accurate you require gestures to be. In fact, it should be very short; the more points to compare against, the better (I'll explain this a little better in the next section).

    Input (received from user)

    When input is measured, the input-measurement interval needs to be short enough that each received consecutive pair of input points is close enough to compare to the expected points.

    Imagine that the user performs some gesture very quickly (and completes it in the time your input-reader reads only three frames). The pattern and input cannot be reliably compared:

    Input too fast for reader

    To avoid this, your input-reader must have a relatively short interval. However, this probably isn't a huge concern, since most hardware can read even the fastest human gestures.

    Back to patterns: they should always be detailed enough to include more points than any possible input. More expected points allow for better accuracy. If a user moves slowly, the input will have more points; if they move quickly, the input will have fewer.

    Consider this: completing a single gesture gives you half as many input frames as the pattern includes. The user has moved at a "normal" speed, so, to simplify the algorithm, you can "dumb down" your pattern by a factor of 2, then compare input coordinates to pattern coordinates directly.

    This method is easier than the alternative that comes to mind (see next section).


    Pattern "density" (coordinate frequency)

    If you have a small number of expected points, you'll have to make approximations to match input.

    Here's an "extreme" example, but it proves the concept. Given this pattern and input:

    Input points don't match pattern points

    Point 3r can't be reliably compared with point 2 or point 3, so you'd have to use some function of points 2, 3, and 3r, to determine if 3r is on the correct path. Now consider the same input, but where the pattern has higher density:

    Pattern more dense than input

    Now, you don't have to compromise, since 3r is essentially definitely on the gesture's pattern. A slight reduction in the pattern's density causes it to match input quite well.


    Positioning

    Relative positioning

    Instead of comparing absolute positions (such as on a touchscreen), you probably want the gesture to be allowed anywhere in some plane of space. To that end, you must relate the start point of the input to some coordinate system.

    Relate input to coordinate system

    Normalization

    To be user-friendly, allow gestures to be done in a range of "sizes". You don't want to compare raw data, because chances are the size of the plane of the input doesn't match the size of the plane of the pattern.

    Normalize the input in the x- and y-direction to match the size of your pattern. Do not maintain aspect ratio.

    1. Relate the input to a coordinate system, as per previous bullet
    2. Find the largest horizontal and vertical distance between any two input points (call them RecMaxH and RecMaxV)
    3. Find the largest horizontal and vertical distance between any two pattern points (call them ExpMaxH and ExpMaxV)
    4. Multiply all input points' x-coordinates by ExpMaxH/RecMaxH
    5. Multiple all input points' y-coordinates by ExpMaxV/RecMaxV

    Normalize input to match size of pattern

    You now have two more-similar sets of points that can be compared. Normalization can be much more detailed than this; for instance, you could normalize sets of 3 points at a time to get incredibly similar images (but you would probably have to do this for each pattern, then compare the sum of all differences to find the most likely matching pattern).

    I suggest storing all gestures' pattern as a graph the same size; that reduces computation when measuring closeness of input to possible pattern matches.


    When to measure input

    User-driven

    Imagine a button that, when clicked/activated, causes your program to begin measuring inputs. This would be similar to Google's Voice Search, which doesn't constantly record and search; instead, you say "Ok Jarvis" or click the handy microphone icon and begin speaking your query.

    Benefits:

    • Simplifies algorithm
    • Prevents user from unintentionally triggering an event. Imagine if every word you spoke was analyzed and sent to Google as part of a search query. Sometimes you just don't mean to do anything.

    Drawbacks:

    • Less user-friendly. User must go out of his/her way to trigger recording for gestures.

    If you're writing, for instance, a gesture-search (ridiculous example), this is probably the better method to implement. Nobody wants every move they make interpreted as an action in your application. However, if you're writing a Kinect-style or gesture-based game, you probably want to be constantly recording and looking for gestures.

    Constant

    Your program constatly records gesture coordinates at the specified interval (this could be reduced to "records if there's movement, otherwise doesn't store coordinates"). You must make a decision: how many "frames" will you record until deciding that the currently-stored motion is not a recognized gesture?

    Store coordinates in a buffer: a queue 1.5 or 2 (to be cautious) times as long as the largest number of frames you're willing to record.

    • Once you determine that there exists in this buffer a sequence of frames that match a pattern, execute that gesture's result, and clear the queue.

    • If there's the possibility that the next gesture is an "option" for the most-recent gesture, record the application state as "currently waiting on option for ____ gesture," and wait for the option to appear.

    • If it's determined that the first x frames in the buffer cannot possibly match a pattern (because of their sequence or positioning), remove them from the queue.

    Benefits:

    • Allows for more dynamic handling of gestures
    • User input recognized automatically

    Drawbacks:

    • More complex algorithm
    • Heavier computation

    If you're writing a game that runs based on real-time input, this is probably the right choice.


    Algorithm

    If you're using user-driven recognition:

    1. Record all input in the allowed timeframe (or until the user signifies that they're done)
    2. To evaluate the input, reduce the density of your pattern to match that of the input
    3. Relate the input to a coordinate system
    4. Normalize input
    5. Use a method of function comparison (looseness of this calculation is up to you: standard deviation, variance, total difference in values, etc.), and choose the least-different possibility.
    6. If no possibility is similar enough to meet your required threshold (you must decide this), don't accept the input.

    If you're using constant measuring:

    In your buffer, treat the sequence of max_sequence_size (you decide) beginning at every multiple of frame_multiples (you decide) as a possible gesture. For instance, if all of my possible gestures are at most 20 frames long, and I believe every 5 frames a new gesture could be starting (and I won't lose any critical data in those 5 frames), I'll compare each portions of the buffer to all possible gestures (portions from 0-19, 5-24, 10-29, etc.). This is heavier computing when frame_multiples decreases. For perfect measurement, frame_multiples is 1 (but this is likely not reasonable).


    Hope you've enjoyed reading this answer as much as I enjoyed writing it. I've never done this before, but you've piqued my interest in a way that doesn't often happen. Please edit and improve my answer! If there's a portion that seems incomplete, add to it. I'm very curious in (particularly, more-experienced) responses and criticism.