I have a dataframe which consists out of 5 million name entries. The structure is as follows:
What one can read from this dataframe is that for instance the name Mary was given to 14 babys in the state Alaska (AK) in the year 1910. But the name Mary were also given to newborns in the other states and the following years as well.
What I would like to identify is: What is the most given name in that particular dataset overall and how often was that name assigned?
I have tried this:
import pandas as pd
from collections import defaultdict
df = pd.read_csv("names.csv")
mask = df[["Name", "Count"]]
counter = 0
dd = defaultdict(int)
for pos, data in mask.iterrows():
name = data["Name"]
dd[name] = dd[name] + data["Count"]
counter += 1
if counter == 100000:
break
print ("Done!")
freq_name = 0
name = ""
for key, value in dd.items():
if freq_name < value:
freq_name = value
name = key
print(name)
This code works pretty well but only for up to 100.000 rows. However, when I use the presented code with the full dataset it takes ages.
Any idea or hint what I could improve would be greatly appreciated.
As suggested in the comments you can use something like this:
df = pd.read_csv("names.csv")
name, total_count = max(df.groupby('Name').Count.sum().items(), key=lambda x: x[1])