Search code examples
pythonpandasdefaultdict

How to sum up certain rows with Pandas and add the result to a defaultdict (large dataset)


I have a dataframe which consists out of 5 million name entries. The structure is as follows:

dataframe

What one can read from this dataframe is that for instance the name Mary was given to 14 babys in the state Alaska (AK) in the year 1910. But the name Mary were also given to newborns in the other states and the following years as well.

What I would like to identify is: What is the most given name in that particular dataset overall and how often was that name assigned?

I have tried this:

import pandas as pd
from collections import defaultdict

df = pd.read_csv("names.csv")

mask = df[["Name", "Count"]]

counter = 0
dd = defaultdict(int)

for pos, data in mask.iterrows():
    name = data["Name"]
    dd[name] = dd[name] + data["Count"]
    counter += 1

    if counter == 100000:
      break
print ("Done!")

freq_name = 0
name = ""

for key, value in dd.items():
    if freq_name < value:
        freq_name = value
        name = key

print(name)

This code works pretty well but only for up to 100.000 rows. However, when I use the presented code with the full dataset it takes ages.

Any idea or hint what I could improve would be greatly appreciated.


Solution

  • As suggested in the comments you can use something like this:

    df = pd.read_csv("names.csv")
    name, total_count = max(df.groupby('Name').Count.sum().items(), key=lambda x: x[1])