I am looking to retrieve the name of an instance of DataFrame, that I pass as an argument to my function, to be able to use this name in the execution of the function. Example in a script:
display(df_on_step_42)
I would like to retrieve the string "df_on_step_42" to use in the execution of the display function (that display the content of the DataFrame).
As a last resort, I can pass as argument of DataFrame and its name:
display(df_on_step_42, "df_on_step_42")
But I would prefer to do without this second argument.
PySpark DataFrames are non-transformable, so in our data pipeline, we cannot systematically put a name attribute to all the new DataFrames that come from other DataFrames.
I finally found a solution to my problem using the inspect and re libraries.
I use the following lines which correspond to the use of the display() function
import inspect
import again
def display(df):
frame = inspect.getouterframes(inspect.currentframe())[1]
name = re.match("\s*(\S*).display", frame.code_context[0])[1]
print(name)
display(df_on_step_42)
The inspect library allows me to get the call context of the function, in this context, the code_context attribute gives me the text of the line where the function is called, and finally the regex library allows me to isolate the name of the dataframe given as parameter.
It’s not optimal but it works.