I have the following dataset:
data = pd.DataFrame({'Members':['Biology PhD student', 'Chemistry Master student', 'Engineering undergraduate student', 'Administration staff',
'Reception staff', 'Research Associate Chemistry', 'Associate Statistics'], 'UCode':[1,1,1,2,2,1,1],'id':['aaa100','aaa121','aa123','bb212','bb214','aa111','aa109']})
data
Members UCode id
0 Biology PhD student 1 aaa100
1 Chemistry Master student 1 aaa121
2 Engineering undergraduate student 1 aa123
3 Administration staff 2 bb212
4 Reception staff 2 bb214
5 Research Associate Chemistry 1 aa111
6 Associate Statistics 1 aa109
where the column df.Members
contains strings describing the function of each listed member.
Which kind of text analysis would you suggest to find groups of similar Members using only the information (text) of the column df.Members
? In this toy example, for instance, the analysis should return two distinct groups. I am thinking about a measure of similarity between two lists of strings/words.
Any suggestion/help is very much appreciated.
Thank you,
Marco
Simple equal Word Counter, for instance
from collections import Counter
WordCounter = Counter()
for text in members:
words = text.split(' ')
for word in words:
WordCounter[word] += 1
print(WordCounter.most_common(3))
Output: [('student', 3), ('staff', 2), ('Associate', 2)]