I'm relatively new to Python/panda. Here is my problem: I have a df looking like this:
df = pd.DataFrame({
'ZIP Code': ['1234','1234', '5678', '9101'],
'City Name': ['City A', 'City A', 'City B', 'City C'],
'Newspaper': ['City A News', 'City A Newspaper', 'News for City B', 'C News'],
})
As you can see, for City A, there are two Newspapers, but they are listed in two different rows. In my real df, I have multiple cases of cities, listed in separate rows, completely identical to each other, except for the listed newspaper.
So I want to create a df that looks somehow like this:
df_wanted = pd.DataFrame({
'ZIP Code': ['1234', '5678', '9101'],
'City Name': ['City A', 'City B', 'City C'],
'Newspaper': ['City A News, City A Newspaper', 'News for City B', 'C News'],
'Number of Newspapers': [2, 1, 0]
})
So basically, I want to remove the duplicate rows and add the string of the not- duplicate newspaper to the first entry for the city.
Additionally, I want to create a column that counts the number of newspapers so I can use the numbers for further analysis.
I hope you can help me with my problem! Thanks in advance :)
So far, I didnt find a solution to my problem.
You can use groupby named agg:
df.groupby('ZIP Code').agg(**{
'City Name': ('City Name', 'first'),
'Newspaper': ('Newspaper', ', '.join),
'No. Newspaper': ('Newspaper', 'size')
})
Output:
City Name Newspaper No. Newspaper
ZIP Code
1234 City A City A News, City A Newspaper 2
5678 City B News for City B 1
9101 City C C News 1