Search code examples
pandasdataframerowdrop-duplicates

Extract String from duplicate row, remove duplicate, give count of strings


I'm relatively new to Python/panda. Here is my problem: I have a df looking like this:

df = pd.DataFrame({
    'ZIP Code': ['1234','1234', '5678', '9101'],
    'City Name': ['City A', 'City A', 'City B', 'City C'],
    'Newspaper': ['City A News', 'City A Newspaper', 'News for City B', 'C News'],
})

As you can see, for City A, there are two Newspapers, but they are listed in two different rows. In my real df, I have multiple cases of cities, listed in separate rows, completely identical to each other, except for the listed newspaper.

So I want to create a df that looks somehow like this:

df_wanted = pd.DataFrame({
'ZIP Code': ['1234', '5678', '9101'],
'City Name': ['City A', 'City B', 'City C'],
'Newspaper': ['City A News, City A Newspaper', 'News for City B', 'C News'],
'Number of Newspapers': [2, 1, 0]

})

So basically, I want to remove the duplicate rows and add the string of the not- duplicate newspaper to the first entry for the city.

Additionally, I want to create a column that counts the number of newspapers so I can use the numbers for further analysis.

I hope you can help me with my problem! Thanks in advance :)

So far, I didnt find a solution to my problem.


Solution

  • You can use groupby named agg:

    df.groupby('ZIP Code').agg(**{
        'City Name': ('City Name', 'first'),
        'Newspaper': ('Newspaper', ', '.join),
        'No. Newspaper': ('Newspaper', 'size')
    })
    

    Output:

             City Name                      Newspaper  No. Newspaper
    ZIP Code                                                        
    1234        City A  City A News, City A Newspaper              2
    5678        City B                News for City B              1
    9101        City C                         C News              1