I'm working with a dataframe that has several columns containing distinct elements that I would like to parse into lists. These items are separated by double pipes (||).
I can currently write single lines of code using the assign function that can separate elements within a single column, but I am having trouble writing a function that could take a list of columns and perform the same function across all of them.
For example, imagine my dataframe is called DATA:
import pandas as pd
d1 = {'Column_1': ["SF||NYG","SF||NYG","SF||NYG",
"SF||NYG","SF||NYG","SF||NYG","SF||NYG"],
'Column_2': ["SF||NYG","SF||NYG","SF||NYG",
"SF||NYG","SF||NYG","SF||NYG","SF||NYG"]}
DATA = pd.DataFrame(d1)
print (DATA)
Column_1 Column_2
0 SF||NYG SF||NYG
1 SF||NYG SF||NYG
2 SF||NYG SF||NYG
3 SF||NYG SF||NYG
4 SF||NYG SF||NYG
5 SF||NYG SF||NYG
6 SF||NYG SF||NYG
If I want to separate out items in Column_1 into a list, I can write code that achieves my objective like this:
DATA = DATA.assign(Column_1=DATA["Column_1"].str.split('\|\|'))
print(DATA)
Column_1 Column_2
0 [SF, NYG] SF||NYG
1 [SF, NYG] SF||NYG
2 [SF, NYG] SF||NYG
3 [SF, NYG] SF||NYG
4 [SF, NYG] SF||NYG
5 [SF, NYG] SF||NYG
6 [SF, NYG] SF||NYG
However, if I attempt to write an abstracted function of this:
def sep_list(df, col_list, sep):
for col in col_list:
df = df.assign(col=df[col].str.split(sep))
The function will run, but the data within columns won't be separated out as I expect. Essentially, the data will look exactly the same as when it went into the function. I feel like it has something to do with the "col=df[col]" part, as the first col in the working line of code above is not surrounded by quotes, and it seems the code from the function would run the first col as being in quotes (e.g., "Column_1"=DATA["Column_1"] instead of Column_1=DATA["Column_1"].
I'm not sure how to re-write my function to do the equivalent of the single code line on a list of columns that are passed into the function. I've tried creating a separate variable within the function that would strip quotes from the col name, but this did not seem to help.
Any help would be greatly appreciated.
You can use a lambda function and apply function to do this:
DATA.apply(lambda x: x.str.split('\|\|'))
Out[33]:
Column_1 Column_2
0 [SF, NYG] [SF, NYG]
1 [SF, NYG] [SF, NYG]
2 [SF, NYG] [SF, NYG]
3 [SF, NYG] [SF, NYG]
4 [SF, NYG] [SF, NYG]
5 [SF, NYG] [SF, NYG]
6 [SF, NYG] [SF, NYG]