I want to take a substring from a filename every time a new file is coming to us for processing and load that value into file. The task here is like suppose we are receiving many files from X company for cleansing process and the first thing what we need to do is to take substring from the file name.
For Example: the file name is 'RV_NETWORK_AXN TECHNOLOGY_7737463273272635'. From this I want to take 'AXN TECHNOLOGY' and want to create a new column with name 'COMPANY NAME' in the same file and load 'AXN TECHNOLOGY" value into it. The file names change, but the company name will every time be after the second underscore.
In the comment, you said that using df_1 = df_1.withColumn('COMPANY', F.split(F.input_file_name(), '_')[3])
extracts AXN TECHMOLOGY.csv
.
I'll suggest 2 options to you:
You could use one more split
on \.
and using element_at
get the 2nd to last element. In this case, splitting on \.
works and on .
doesn't, because this argument of split
function is not a simple string, but a regex pattern; and unescaped dot .
in regex has a meaning of "any character".
df = df.withColumn(
'COMPANY',
F.element_at(F.split(F.split(F.input_file_name(), '_')[3], '\.'), -2)
)
The following regex pattern would extract only what's after the 3rd _
and potential 4th _
, but not including file extension (e.g. .csv
).
df = df.withColumn(
'COMPANY',
F.regexp_extract(F.input_file_name(), r'^.+?_.*?_.*?_([^_]+)\w*\.\w+$', 1)
)