Search code examples
algorithmnlpaggregatededuplication

De-Duplicating sets of n-grams


I need to come up with a way to sort and display the most relevant data to users. Our data consists of multiple n-grams that are extracted from Social Media. We call these 'topics'.

The problem I am facing is that the data contains a lot of duplication. While each String is not a direct duplicate of another, they are sub-sets. To a User, this information appears duplicated. Here's some sample data:

{
    "count": 1.0, 
    "topic": "lazy people"
}, 
{
    "count": 1.0, 
    "topic": "lazy people taking"
}, 
{
    "count": 1.0, 
    "topic": "lazy people taking away food stamps"
}

An edge case is that the phrase "lazy people" can be extracted from other phrases. For example, "lazy people are happy". Using the smallest common denominator ("lazy people" in this case) does not seem like a good idea because the end-User would not be presented the different contexts ("taking away food stamps" and "are happy").

On the other hand, taking the longest N-Gram may be too much information. In the example I gave above, that seems logical. However, that may not always hold true.

My overall goal is to present this data in a way that is informative and ranked.

Are there any existing solutions and corresponding algorithms to solve this class of problems?

Note: Initially my question was extremely vague and unclear. In fact, that led me to changing the question all together because what I really need is guidance in what my end-result should be.

Note 2: Let me know if I've mis-used any terms or should modify the title of this question to enhance others searching for answers to this type of a question.


Solution

  • This is a hard problem and solutions tend to be very application specific. Typically you'd collect more than just the n-grams and counts. For example, it usually matters if a particular n-gram is used a lot by a single person, or by a lot of people. That is, if I'm a frequent poster and I'm passionate about wood carving, then the n-gram "wood carving" might show up as a common term. But I'm the only person who cares about it. On the other hand, there might be many people who are into oil painting, but they post relatively infrequently and so the count for the n-gram "oil painting" is close to the count for "wood carving." But it should be obvious that "oil painting" will be relevant to your users and "wood carving" would not be. Without information about what pages the n-grams come from, it's impossible to say which would be relevant to more users.

    A common way to identify the most relevant phrases across a corpus of documents is called TF-IDF: Term frequency-inverse document frequency. Most descriptions you see concern themselves with individual words, but it's simple enough to extend that to n-grams.

    This assumes, of course, that you can identify individual documents of some sort. You could consider each individual post as a document, or you could group all of the posts from a user as a larger document. Or maybe all of the posts from a single day are considered a document. How you identify documents is up to you.

    A simple TF-IDF model is not difficult to build and it gives okay results for a first cut. You can run it against a sample corpus to get a baseline performance number. Then you can add refinements (see the Wikipedia article and related pages), always testing their performance against your pure TF-IDF baseline.

    Given the information I have, that's where I would start.