Search code examples
pythonpandasnumpywhere-clauseseries

Count number of occurrences in Pandas column with some multiples


I have a Pandas dataframe with a column called specialty that looks like this:

0         1,5
1           1
2           1
3           1
4           1
5         1,5
6           3
7           3
8           1
9         1,3
10          1
11    1,2,4,6

I want to count the number of occurrences of each number. If there is more than one value for a certain cell, I would want it to be added to the count of 7, but I'd also like each number in that cell to be added to its respective total. For example, for this data my desired output would look something like this:

1: 10
2: 1
3: 3
4: 1
5: 2
6: 1
7: 4

The 7 here is referring to the fact that there are 4 rows with multiple values. However, each of those numbers should be added to the overall count for each number.

Here's what I have so far, but it only counts multiples instead of also each number in the multiple. I'm also computing the percentage of each value.

# Convert specialties with more than 1 to item 7 (multiple)
part_chars['specialty'] = np.where(part_chars['specialty'].str.len() > 1, 7, part_chars['specialty'])

counts = part_chars.specialty.value_counts()
percs = part_chars.specialty.value_counts(normalize=True)*100
pd.concat([counts,percs], axis=1, keys=['count', 'percentage'])

And the output:

   count    percentage
7   213     40.804598
1   211     40.421456
3   39      7.471264
5   23      4.406130
6   13      2.490421
4   12      2.298851
2   11      2.107280

I'm not sure how to make the counts work the way I want them. Would appreciate any help with this.


Solution

  • You can split by the , and explode to get value counts of unique values. Then count the , rows and assign it to 7

    out = df['specialty'].str.split(',').explode().value_counts()
    out.loc[7] = df['specialty'].str.contains(',').sum()
    # or in one line
    out = pd.concat([df['specialty'].str.split(',').explode().value_counts(),
                     pd.Series([df['specialty'].str.contains(',').sum()], index=[7])])
    
    print(out)
    
    1    10
    3     3
    5     2
    2     1
    4     1
    6     1
    7     4
    Name: specialty, dtype: int64