Search code examples
pythonnlpnltkanalysis

Is there a function that allows me to determine if a text talks about a pre defined topic?


I want to write topic lists to check whether a review talks about one of the defined topics. It's important for me to write the topic lists myself and not use topic modeling to find possible topics.

I thought this is called dictionary analysis, but I can't find anything.

I have a data frame with reviews from amazon:

df = pd.DataFrame({'User': ['UserA', 'UserB','UserC'], 
'text': ['Example text where he talks about a phone and his charging cable',
 'Example text where he talks about a car with some wheels',
 'Example text where he talks about a plane']})

Now I want to define topic lists:

phone = ['phone', 'cable', 'charge', 'charging', 'call', 'telephone']
car = ['car', 'wheel','steering', 'seat','roof','other car related words']
plane = ['plane', 'wings', 'turbine', 'fly']

The result of the method should be 3/12 for the "phone" topic of the first review (3 words of the topic list where in the review which has 12 words) and 0 for the other two topics.

The second review would result in 2/11 for the "car" topic and 0 for the other topics and for the third review 1/8 for the "plane" topic and 0 for the others.

Results as a list:

phone_results = [0.25, 0, 0]
car_results = [0, 0.18181818182, 0]
plane_results = [0, 0, 0.125]

Of course I would only use lowercase wordstems of the reviews which makes defining topics easier, but this should not be of concern now.

Is there a method for this or do I have to write one? Thank you in advance!


Solution

  • NLP can be quite deep, but for something about the ratio of known words, you could probably do something more basic. For example:

    word_map = {
        'phone': ['phone', 'cable', 'charge', 'charging', 'call', 'telephone'],
        'car': ['car', 'wheels','steering', 'seat','roof','other car related words'],
        'plane': ['plane', 'wings', 'turbine', 'fly']
    }
    sentences = [
         'Example text where he talks about a phone and his charging cable',
         'Example text where he talks about a car with some wheels',
         'Example text where he talks about a plane'
    ]
    
    for sentence in sentences:
        print '==== %s ==== ' % sentence
        words = sentence.split()
        for prefix in word_map:
            match_score = 0
            for word in words:
                if word in word_map[prefix]:
                    match_score += 1
            print 'Prefix: %s | MatchScore: %.2fs' % (prefix, float(match_score)/len(words)) 
    

    And you'd get something like this:

    ==== Example text where he talks about a phone and his charging cable ==== 
    Prefix: phone | MatchScore: 0.25s
    Prefix: plane | MatchScore: 0.00s
    Prefix: car | MatchScore: 0.00s
    ==== Example text where he talks about a car with some wheels ==== 
    Prefix: phone | MatchScore: 0.00s
    Prefix: plane | MatchScore: 0.00s
    Prefix: car | MatchScore: 0.18s
    ==== Example text where he talks about a plane ==== 
    Prefix: phone | MatchScore: 0.00s
    Prefix: plane | MatchScore: 0.12s
    Prefix: car | MatchScore: 0.00s
    

    This is a basic example of course, and words sometimes don't end in spaces -- it could be commas, periods, etc. So you'd want to take that into account. And also the tense I can "phone" someone or "phoned", or "phoning", but also we wouldn't want a word such as "phonetic" to get mixed up. So it gets pretty tricky on edge cases, but for a very basic working(!) example, I would see if you can do it in python without using a natural language library. And eventually, if it doesn't meet your use case, you can start testing them out.

    Beyond that you can look at something like Rasa NLU or nltk.