I have to create a Pandas dataframe from a csv file using a pipeline.
The src csv file may contain any number of columns with header/name containing the string 'SLA'. Sample data below:
While creating the pandas pipeline I have to extract and store only the string before the first delimeter ('|') for all the SLA columns. For example for ID=1 the SLA1 in csv contains the value '24h|0h|13h' and I will have to store only the 24h in the dataframe (similarly for other SLA columns)
My code is as follows:
import pandas as pd
def get_sla_cols(df):
return [col for col in df.columns if 'SLA' in col]
def split(df, cols, split_str):
for col in cols:
df[col] = df[col].str.split(split_str, expand=True, n=1)[0]
return df
csv_path = r"C:\Users\daryl\Downloads\svc.csv"
svc_df = (pd.read_csv(csv_path)
.pipe(split, lambda x: x.pipe(get_sla_cols), '|'))
But if I run:
print(pd.read_csv(csv_path).pipe(lambda x: x.pipe(get_sla_cols)))
I'm getting the below output as expected:
As the code lambda x: x.pipe(get_sla_cols)
is generating the list of column names why the function split(df, cols, split_str)
throws error that it cannot iterate over the list of columns in the for loop? (refer to the error screenshot).
Note: If I replace lambda x: x.pipe(get_sla_cols)
with hardcoded list say ['SLA1', 'SLA2', 'SLA3', 'SLA4', 'SLA5'] the code (split() function) throws no error and working as expected.
this should work then :
svc_df = (pd.read_csv(csv_path)
.pipe(lambda df: split(df, get_sla_cols(df), '|')))
Using a lambda
function for the whole pipe
.