python pandas dataframe data-science data-processing

Pandas: Combine rows having same date different time into a single row of the same date(consolidate partial data of different time for same identity)

I have a sample dataframe as given below.

import pandas as pd
import numpy as np

NaN = np.nan
data = {'ID':['A', 'A', 'A', 'B','B','B'],
    'Date':['2021-09-20 04:34:57', '2021-09-20 04:37:25', '2021-09-20 04:38:26', '2021-09-01 
    00:12:29','2021-09-01 11:20:58','2021-09-02 09:20:58'],
    'Name':['xx',NaN,NaN,'yy',NaN,NaN],
    'Height':[174,NaN,NaN,160,NaN,NaN],
    'Weight':[74,NaN,NaN,58,NaN,NaN],
    'Gender':[NaN,'Male',NaN,NaN,'Female',NaN],
    'Interests':[NaN,NaN,'Hiking,Sports',NaN,NaN,'Singing']}

 df1 = pd.DataFrame(data)
 df1

I want to combine the data present on the same date into a single row. The 'Date' column is in timestamp format. The final output should look like the image shown below.

Any help is greatly appreciated. Thanks.

Solution

New Solution

The old solution was based on initial version of question where empty strings instead of NaN values were used for undefined values and all columns were of string types. With updated question using NaN for undefined values (and even when also updated to have different column data types of numeric and string types), the solution can be simplified as follows:

You can use .groupby() + GroupBy.last() to group by ID and date (without time) and then aggregate the NaN and non-NaN elements with the latest (asssuming column Date is presented in chronological order) non-NaN values for an ID, as follows:

# Convert `Date` to datetime format
df1['Date'] = pd.to_datetime(df1['Date'])

# Sort `df1` with ['ID', 'Date'] order if not already in this order
#df1 = df1.sort_values(['ID', 'Date'])

df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
             .last()
             .reset_index()
         ).replace([None], [np.nan])

Result:

print(df_out)


   ID       Date Name Height Weight  Gender      Interests
0  A 2021-09-20   xx  174.0   74.0    Male  Hiking,Sports
1  B 2021-09-01   yy  160.0   58.0  Female            NaN
2  B 2021-09-02  NaN    NaN    NaN     NaN        Singing

Old Solution

You can use .groupby() + .agg() to group by ID and date and then aggregate the NaN and non-NaN elements, as follows:

# Convert `Date` to datetime format
df1['Date'] = pd.to_datetime(df1['Date'])

df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
             .agg(lambda x: ''.join(x.dropna().astype(str)))
             .reset_index()
         ).replace('', np.nan)

Result:

print(df_out)


   ID       Date Name Height Weight  Gender      Interests
0  A 2021-09-20   xx  174.0   74.0    Male  Hiking,Sports
1  B 2021-09-01   yy  160.0   58.0  Female            NaN
2  B 2021-09-02  NaN    NaN    NaN     NaN        Singing

As your original question had all columns of string types, the above codes work fine to give results of all columns as string types. However, your edited question has data with both numeric and string types. In order to retain the original data types, we can modify the codes as follows:

# Convert `Date` to datetime format
df1['Date'] = pd.to_datetime(df1['Date'])

df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
             .agg(lambda x: np.nan if len(w:=x.dropna().reset_index(drop=True)) == 0 else w)
             .reset_index()
         )

Result:

print(df_out)


   ID       Date Name Height Weight  Gender      Interests
0  A 2021-09-20   xx  174.0   74.0    Male  Hiking,Sports
1  B 2021-09-01   yy  160.0   58.0  Female            NaN
2  B 2021-09-02  NaN    NaN    NaN     NaN        Singing


print(df_out.dtypes)

ID                   object
Date         datetime64[ns]
Name                 object
Height              float64            <==== retained as numeric dtype
Weight              float64            <==== retained as numeric dtype
Gender               object
Interests            object
dtype: object