Use NLP (NLTK) to identify groups of phrases in a python dataframe

I have a table containing diagnosis information for a large group of patients. I would like to determine what the most common groupings of those diagnoses are, for example is it "Bloaty Head Syndrome" and "Slack Tongue", or "Broken Wind", "Chronic Nosehair" and "Corrugated Ankles"... or some other combination.

Data is structured like so:

import pandas as pd
import numpy as np

# List of ids
ids = ['id1', 'id2', 'id3','id4','id5'] 

# List of sample sentences 
diagnosis = ["Broken Wind","Chronic Nosehair","Corrugated Ankles","Discrete Itching"]

# Create dataframe
df = pd.DataFrame({'id': ids})

# Generate list of sentences for each id
df['diagnosis'] = df['id'].apply(lambda x: np.random.choice(diagnosis, 5).tolist())

# Explode into separate rows
df = df.explode('diagnosis')

print(df)

For example if both id2 and id5 contain "Broken Wind" and Chronic Nosehair" that would be 2 of that combination. If id1, id3 and id4 contain "Chronic Nosehair","Corrugated Ankles", and "Discrete Itching" that would be 3 of that combination.

With the goal of determining which combination is most common.

I'm wondering is there an nlp library such as NLTK, or a method, that can be used to process data stored like this in a pandas dataframe? Most of what I have been able to find so far is geared toward sentiment analysis or analyzing single words as opposed to phrases...

Solution

I would offer that what you are trying to do here is not necessarily an NLP problem, but a much more general frequent pattern mining problem which is typically seen in recommendation.

You can find the most frequent diagnosis combinations of any size by using the fpgrowth algorithm in the mlxtend library and looking at the support for each symptom or combinations thereof:

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth

# Create list of diagnoses for each patient
x = df.groupby('id').apply(lambda x:list(x['diagnosis']))

# Encode to wide dataframe with column for each symptom
te = TransactionEncoder()
te_ary = te.fit(x).transform(x)
te_df = pd.DataFrame(te_ary, columns=te.columns_)

# Calculate most frequent diagnosis co-occurrences
fp_df = fpgrowth(te_df, min_support=0.01, use_colnames=True)

# Sort and show
fp_df.sort_values(by='support', ascending=False)

The resultant table is a list of tuples with the support, the percentage of "transactions" (here, patients) for which the combinations occur:

| support | itemsets                                                 |
| ------- | -------------------------------------------------------- |
| 0.8     | {'Broken Wind'}                                          |
| 0.6     | {'Corrugated Ankles'}                                    |
| 0.6     | {'Chronic Nosehair'}                                     |
| 0.6     | {'Discrete Itching'}                                     |
| 0.6     | {'Corrugated Ankles', 'Broken Wind'}                     |
| 0.4     | {'Chronic Nosehair', 'Broken Wind'}                      |
| 0.4     | {'Discrete Itching', 'Chronic Nosehair'}                 |
| 0.4     | {'Discrete Itching', 'Broken Wind'}                      |
| 0.2     | {'Corrugated Ankles', 'Discrete Itching'}                |
| 0.2     | {'Discrete Itching', 'Corrugated Ankles', 'Broken Wind'} |
| 0.2     | {'Corrugated Ankles', 'Chronic Nosehair'}                |
| 0.2     | {'Chronic Nosehair', 'Discrete Itching', 'Broken Wind'}  |
| 0.2     | {'Chronic Nosehair', 'Corrugated Ankles', 'Broken Wind'} |