I have a Pandas dataframe with a column called specialty
that looks like this:
0 1,5
1 1
2 1
3 1
4 1
5 1,5
6 3
7 3
8 1
9 1,3
10 1
11 1,2,4,6
I want to count the number of occurrences of each number. If there is more than one value for a certain cell, I would want it to be added to the count of 7
, but I'd also like each number in that cell to be added to its respective total. For example, for this data my desired output would look something like this:
1: 10
2: 1
3: 3
4: 1
5: 2
6: 1
7: 4
The 7
here is referring to the fact that there are 4 rows with multiple values. However, each of those numbers should be added to the overall count for each number.
Here's what I have so far, but it only counts multiples instead of also each number in the multiple. I'm also computing the percentage of each value.
# Convert specialties with more than 1 to item 7 (multiple)
part_chars['specialty'] = np.where(part_chars['specialty'].str.len() > 1, 7, part_chars['specialty'])
counts = part_chars.specialty.value_counts()
percs = part_chars.specialty.value_counts(normalize=True)*100
pd.concat([counts,percs], axis=1, keys=['count', 'percentage'])
And the output:
count percentage
7 213 40.804598
1 211 40.421456
3 39 7.471264
5 23 4.406130
6 13 2.490421
4 12 2.298851
2 11 2.107280
I'm not sure how to make the counts work the way I want them. Would appreciate any help with this.
You can split by the ,
and explode to get value counts of unique values. Then count the ,
rows and assign it to 7
out = df['specialty'].str.split(',').explode().value_counts()
out.loc[7] = df['specialty'].str.contains(',').sum()
# or in one line
out = pd.concat([df['specialty'].str.split(',').explode().value_counts(),
pd.Series([df['specialty'].str.contains(',').sum()], index=[7])])
print(out)
1 10
3 3
5 2
2 1
4 1
6 1
7 4
Name: specialty, dtype: int64