Search code examples
pythonpandasdataframenangroup

Pandas DataFrame and grouping Pandas Series data into individual columns by value


I am hoping someone can help me optimize the following Python/Pandas code. My code works, but I know there must be a cleaner and faster way to perform the operation under consideration.

I am looking for an optimized strategy because my use case will involve 16 unique ADC Types, as opposed to 4 in the example below. Also, my initial Pandas Series (i.e. ADC Type column), will be several 100,000 data points in length, rather than 8 in the example below.

import numpy as np
import pandas as pd
from enum import Enum

data_dict = {"RAW": [4000076160, 5354368, 4641792, 4289860736,
                     4136386944, 5440384, 4772864, 4289881216],
             "ADC_TYPE": [3, 7, 8, 9,
                          3, 7, 8, 9]}
df = pd.DataFrame(data_dict)
print(df)

The initial DataFrame (i.e. df) is:

          RAW  ADC_TYPE
0  4000076160         3
1     5354368         7
2     4641792         8
3  4289860736         9
4  4136386944         3
5     5440384         7
6     4772864         8
7  4289881216         9

I then manipulate the DataFrame above using the following code:

unique_types = df["ADC_TYPE"].unique()

dict_concat = {"RAW": [],
               "ADC_TYPE_3": [],
               "ADC_TYPE_7": [],
               "ADC_TYPE_8": [],
               "ADC_TYPE_9": []}
df_concat = pd.DataFrame(dict_concat)

for adc_type in unique_types:
    df_group = df.groupby(["ADC_TYPE"]).get_group(adc_type).rename(columns={"ADC_TYPE": f"ADC_TYPE_{adc_type}"})
    df_concat = pd.concat([df_concat, df_group])

print(df_concat.sort_index())

The returned DataFrame (i.e. df_concat) is displayed below. The ordering of RAW and the associated ADC Type values must remain unchanged. I need the return DataFrame to look just like the DataFrame below.

            RAW  ADC_TYPE_3  ADC_TYPE_7  ADC_TYPE_8  ADC_TYPE_9
0  4.000076e+09         3.0         NaN         NaN         NaN
1  5.354368e+06         NaN         7.0         NaN         NaN
2  4.641792e+06         NaN         NaN         8.0         NaN
3  4.289861e+09         NaN         NaN         NaN         9.0
4  4.136387e+09         3.0         NaN         NaN         NaN
5  5.440384e+06         NaN         7.0         NaN         NaN
6  4.772864e+06         NaN         NaN         8.0         NaN
7  4.289881e+09         NaN         NaN         NaN         9.0

Solution

  • I liked the idea of using get_dummies, so I modified it a bit:

    df = (pd.get_dummies(df, 'ADC_TYPE', '_', columns=['ADC_TYPE'])
            .replace(1, np.nan)
            .apply(lambda x: x.fillna(df['ADC_TYPE']))
            .replace(0, np.nan))
    

    Output:

              RAW  ADC_TYPE_3  ADC_TYPE_7  ADC_TYPE_8  ADC_TYPE_9
    0  4000076160         3.0         NaN         NaN         NaN
    1     5354368         NaN         7.0         NaN         NaN
    2     4641792         NaN         NaN         8.0         NaN
    3  4289860736         NaN         NaN         NaN         9.0
    4  4136386944         3.0         NaN         NaN         NaN
    5     5440384         NaN         7.0         NaN         NaN
    6     4772864         NaN         NaN         8.0         NaN
    7  4289881216         NaN         NaN         NaN         9.0