Search code examples
pythontextnlpcluster-analysissimilarity

Grouping together text descriptions in Python


I have the following dataset:

data = pd.DataFrame({'Members':['Biology PhD student', 'Chemistry Master student', 'Engineering undergraduate student', 'Administration staff',                           
 'Reception staff', 'Research Associate Chemistry', 'Associate Statistics'], 'UCode':[1,1,1,2,2,1,1],'id':['aaa100','aaa121','aa123','bb212','bb214','aa111','aa109']})

data

             Members                     UCode  id
    0   Biology PhD student                1    aaa100
    1   Chemistry Master student           1    aaa121
    2   Engineering undergraduate student  1    aa123
    3   Administration staff               2    bb212
    4   Reception staff                    2    bb214
    5   Research Associate Chemistry       1    aa111
    6   Associate Statistics               1    aa109

where the column df.Members contains strings describing the function of each listed member.

Which kind of text analysis would you suggest to find groups of similar Members using only the information (text) of the column df.Members ? In this toy example, for instance, the analysis should return two distinct groups. I am thinking about a measure of similarity between two lists of strings/words. Any suggestion/help is very much appreciated. Thank you, Marco


Solution

  • Simple equal Word Counter, for instance

    from collections import Counter
    
    WordCounter = Counter()
    for text in members:
        words = text.split(' ')
        for word in words:
            WordCounter[word] += 1
    
    print(WordCounter.most_common(3))
    

    Output: [('student', 3), ('staff', 2), ('Associate', 2)]