I want to write a script/function to process a fairly generic pandas dataframe. The data frame is the result of a pd.merge()
operation between two data frames, one of which is supplied by the user in the form of a CSV file with arbitrary columns. I know all the values are text but that's about it.
I want to create a new column in the dataframe based on the values of a combination of other columns. This is a fairly easy problem when the column names are known, but in this case I know neither the names of the columns nor how they are to be combined in advance.
Suppose the dataframe df
looks like this:
print(df)
col1 col2
row1 abc def
row2 ghi jkl
And suppose also that the desired manipulation is "create a new column by adding suffix _x
to the values in col2
". It would be nice if the use could express a template for this manipulation using f-strings. In this case, the template might be ..
template = 'f"{col2}_x"'
The double-quoting is is intentional -- I want to delay the evaluation of the template until it is applied to the dataframe. Note that the user will know the names of at least some columns (the ones that they supplied via the CSV file) and so they can specify a template based on these column names.
I was hoping that I could simply use pd.eval()
or more specifically df.eval()
which evaluates an expression using the namespace provided by the column names.
Something like
df["new_col"] = df.eval(template)
But this returns ...
AttributeError: 'PythonExprVisitor' object has no attribute 'visit_JoinedStr'
.. and I gather that pd.eval()
does not support f-strings as of March 2023: https://github.com/pandas-dev/pandas/issues/52310
I guess I could export the dataframe to a dictionary and then iterate over the rows, but I was really hoping for a neat, pandas-esque solution.
Update:
I've also been trying to achieve something simular with pd.DataFrame.apply()
In the concrete case, I can iterate over the dataframe as follows:
ser=pd.Series()
for index,row in df.iterrows():
locals().update(row.to_dict())
ser[index]=eval(template)
df["new_col"]=ser
This might have to do, but locals().update()
feels like a hack and all the advice I have read recommends vectorising over iterating. I'm also having a tough time wrapping this in a function (named or lambda
) to pass into pd.DataFrame.apply()
because again I don't know the column names ahead of time.
Here is one way to do it with Python built-in function eval, which must be used with caution:
import pandas as pd
df = pd.DataFrame({"col1": ["abc", "ghi"], "col2": ["def", "jkl"]})
template = 'df[f"{name}"]+"_x"'
name = "col2"
df["new_col"] = eval(template)
Then:
print(df)
# Output
col1 col2 new_col
0 abc def def_x
1 ghi jkl jkl_x