Search code examples
pythonpandasdataframegroup-bypreprocessor

Using groupby for user id and combining strings


I'm having trouble with preprocessing data. I have data that looks like

I want to group by a field that denotes the user called Account Number, and I want to create a new field that is a concatenation of all values of Customer Event Type for each Account Number.

I've tried this:

df_by_accnum = df.groupby('Account Number')[['Customer Event Type']].agg(','.join).reset_index()

but it results in all the column names being concatenated instead of hte values (https://i.sstatic.net/Um0aU.png)

Could I get some help trouble shooting this? Thank you


Solution

  • By try and error I found out that it might be due the type of the column Customer Event Type. I guess it might hold non string values, which join assumes, and for some reason the column names are used instead.

    Try creating a new column converting Customer Event Type to strings and use that one:

    >>> d = {'Account Number': [1, 2, 3, 1], 'Customer Event Type': [1, 1, 2, 2]}
    >>> df = pd.DataFrame(data=d)
    >>> df['Customer Event Type str'] = df['Customer Event Type'].astype(str)
    >>> df.groupby('Account Number')[['Customer Event Type str']].agg(','.join).reset_index()
       Account Number Customer Event Type str
    0               1                     1,2
    1               2                       1
    2               3                       2
    

    Whereas using the numeric column produces:

    >>> df.groupby('Account Number')[['Customer Event Type']].agg(','.join).reset_index()
       Account Number                                Customer Event Type
    0               1  Account Number,Customer Event Type,Customer Ev...
    1               2  Account Number,Customer Event Type,Customer Ev...
    2               3  Account Number,Customer Event Type,Customer Ev...
    

    Hope it helps. Let us know if you find out the reason for this behavior. Thanks!