Search code examples
pythonpandasdataframepandas-groupbyfrequency

Get the frequency of individual items in a list of each row of a column in a dataframe


Problem Statement

I have a pandas dataframe in which one of the column's values is of type list. I need to get the frequency of each item on that particular list.

For example:

import pandas as pd
data = [
    {
        "name": "fruits",
        "values": ["apple", "banana", "cherry", "apple", "mango", "banana", "apple"]
    },
    {
        "name": "cars",
        "values": ["Audi", "Ferrari", "Ferrari", "Audi", "honda", "Audi"]
    },
    {
        "name": "animals",
        "values": ["dogs", "cats", "tiger", "tiger", "cats", "cats", "camel"]
    }
]
df = pd.DataFrame(data)

Here if we print df, we will see the following dataframe.

    name    values
0   fruits  [apple, banana, cherry, apple, mango, banana, ...
1   cars    [Audi, Ferrari, Ferrari, Audi, honda, Audi]
2   animals [dogs, cats, tiger, tiger, cats, cats, camel]

Now, I would like to count the frequency of occurrence of each item in each row of values columns.

What I did

The solution I did is not efficient, but it was what I came up with so far. So trying to get the best possible way to solve it.

I took the help of python loops to count the frequency and again convert it back to the dataframe.

frequency_list = []
for idx, row in df.iterrows():
    frequency = [{ "name": row["name"], "value": x, "frequency": row["values"].count(x)} for x in list(set(row["values"]))]
    # sorting and getting top 5 frequency is optional
    frequency_list.append(sorted(frequency, key=lambda x: x["frequency"], reverse=True)[:5])

On printing frequency_list we will get.

[[{'name': 'fruits', 'value': 'apple', 'frequency': 3},
  {'name': 'fruits', 'value': 'banana', 'frequency': 2},
  {'name': 'fruits', 'value': 'cherry', 'frequency': 1},
  {'name': 'fruits', 'value': 'mango', 'frequency': 1}],
 [{'name': 'cars', 'value': 'Audi', 'frequency': 3},
  {'name': 'cars', 'value': 'Ferrari', 'frequency': 2},
  {'name': 'cars', 'value': 'honda', 'frequency': 1}],
 [{'name': 'animals', 'value': 'cats', 'frequency': 3},
  {'name': 'animals', 'value': 'tiger', 'frequency': 2},
  {'name': 'animals', 'value': 'camel', 'frequency': 1},
  {'name': 'animals', 'value': 'dogs', 'frequency': 1}]]

Now I started to create a dataframe for each item in frequency_list and concatenate them.

frequency_df = pd.DataFrame()
for each_frequency in frequency_list:
    temp_df = pd.DataFrame(each_frequency)
    if frequency_df.empty:
        frequency_df = temp_df
    else:
        frequency_df = pd.concat((frequency_df, temp_df), axis=0, ignore_index=True)

The data that frequency_df is holding looks like this:

    name    value   frequency
0   fruits  apple   3
1   fruits  banana  2
2   fruits  cherry  1
3   fruits  mango   1
4   cars    Audi    3
5   cars    Ferrari 2
6   cars    honda   1
7   animals cats    3
8   animals tiger   2
9   animals camel   1
10  animals dogs    1

EXPECTED OUTPUT

                frequency
name    value   
animals camel   1
        cats    3
        dogs    1
        tiger   2
cars    Audi    3
        Ferrari 2
        honda   1
fruits  apple   3
        banana  2
        cherry  1
        mango   1

Solution

  • Try:

    print(
        df.explode("values")
        .groupby(["name", "values"])
        .size()
        .to_frame(name="frequency")
    )
    

    Prints:

                     frequency
    name    values            
    animals camel            1
            cats             3
            dogs             1
            tiger            2
    cars    Audi             3
            Ferrari          2
            honda            1
    fruits  apple            3
            banana           2
            cherry           1
            mango            1