I'm trying to count the number of times a combination of strings appear in each row of a dataframe. Each ID uses a number of methods (some IDs use more methods than others) and I want to count the number of times any two methods have been combined together.
# df is from csv and has blank cells - I've used empty strings to demo here
df = pd.DataFrame({'id': ['101', '102', '103', '104'],
'method_1': ['HR', 'q-SUS', 'PEP', 'ET'],
'method_2': ['q-SUS', 'q-IEQ', 'AUC', 'EEG'],
'method_3': ['SC', '', 'HR', 'SC'],
'method_4': ['q-IEQ', '', 'ST', 'HR'],
'method_5': ['PEP', '', 'SC', '']})
print(df)
id method_1 method_2 method_3 method_4 method_5
0 101 HR q-SUS SC q-IEQ PEP
1 102 q-SUS q-IEQ
2 103 PEP AUC HR ST SC
3 104 ET EEG SC HR
I want to end up with a table that looks something like this:
Method A | Method B | Number of Times Combined |
---|---|---|
HR | SC | 3 |
HR | q-SUS | 1 |
HR | PEP | 2 |
q-IEQ | q-SUS | 2 |
EEG | ET | 1 |
EEG | SC | 1 |
etc. | etc. | etc. |
So far I've been trying variations of this code using itertools.combinations and collections Counter:
import numpy as np
import pandas as pd
import itertools
from collections import Counter
def get_all_combinations_without_nan(row):
# remove nan - this is for the blank csv cells
set_without_nan = {value for value in row if isinstance(value, str)}
# generate all combinations of values in row
all_combinations = []
for index, row in df.iterrows():
result = list(itertools.combinations(set_without_nan, 2))
all_combinations.extend(result)
return all_combinations
# get all possible combinations of values in a row
all_rows = df.apply(get_all_combinations_without_nan, 1).values
all_rows_flatten = list(itertools.chain.from_iterable(all_rows))
count_combinations = Counter(all_rows_flatten)
print(count_combinations)
It's doing something, but it seems to be counting multiple times or something (it's counting more combinations than are actually there. I've had a good look on Stack, but can't seem to solve this - everything seems really close though!
I hope someone can help - Thanks!
Use DataFrame.melt
for reshape with remove empty strings or missing values, then use DataFrame.merge
for all combinations, remove rows with same methods and count by GroupBy.size
:
df1 = df.melt('id', value_name='method_')
df1 = df1[(df1["method_"] != '') & (df1["method_"].notna())]
df = (df1.merge(df1, on='id', suffixes=('A','B'))
.query("method_A != method_B")
.groupby(['method_A','method_B'])
.size()
.reset_index(name='Number of Times Combined'))
print (df.head(20))
method_A method_B Number of Times Combined
0 AUC HR 1
1 AUC PEP 1
2 AUC SC 1
3 AUC ST 1
4 EEG ET 1
5 EEG HR 1
6 EEG SC 1
7 ET EEG 1
8 ET HR 1
9 ET SC 1
10 HR AUC 1
11 HR EEG 1
12 HR ET 1
13 HR PEP 2
14 HR SC 3
15 HR ST 1
16 HR q-IEQ 1
17 HR q-SUS 1
18 PEP AUC 1
19 PEP HR 2