file apache-spark pyspark substring filenames

How to pick the substring from a filename and load that substring into a table column in PySpark?

We are reading files from ADLS through this kind of command:

relative_path = "ContainerName/"
input_file_name = "ss_old_data_Amazon_Blast_202211093837474.csv"

How can we fetch a substring from a file name that we are getting for processing.

For example:
1st filename: vm_Path_Accenture_Complex_Union_202211027373.csv
2nd filename: vm_path_Google_is_a_good_company_20221109473.csv
3rd filename: ss_old_data_Amazon_Blast_202211093837474.csv
4th filename: ss_old_data_Black_Adam_(TY)_2022847093837474.csv
5th filename: ss_old_data_Man_of_steel_(PQ)_2022847093837474.csv

We need to pick specific substring from the filename and remove _ from substring and make it upper case. I don't need to pick the date format (numbers) or .csv - just the company name. vm_path will be there at the beginning for most of the files, and sometimes ss_old_data. We need to remove this part too.

Expected output:
1st filename should become: ACCENTURECOMPLEXUNION
2nd filename should become: GOOGLEISAGOODCOMPANY
3rd filename should become: AMAZONBLAST
4th filename should become : BLACKADAM
5th filename should become: MANOFSTEEL

How can we achieve it in PySpark?

Solution

Since you store your file name in a variable, you can reuse it and save it as a literal (lit). But before saving you will have to do some modifications to the literal.

ss_old_data_Amazon_Blast_202211093837474.csv <- input
_Amazon_Blast <- after applying regexp_extract with pattern '(?i)(vm_path|ss_old_data)(.+)_'
AmazonBlast <- after applying translate(..., '_', '')
AMAZONBLAST <- after applying upper

from pyspark.sql import functions as F

input_file_name ="ss_old_data_Amazon_Blast_202211093837474.csv"

df = spark.read.csv(input_file_name)
df = df.withColumn('file_name', F.lit(input_file_name))

c = F.regexp_extract('file_name', '(?i)(vm_path|ss_old_data)([^(]+)(\(.+)?_', 2)
df = df.withColumn('file_name', F.upper(F.translate(c, '_', '')))
                   
df.show()
# +----------+-----------+
# |       _c0|  file_name|
# +----------+-----------+
# |csv_line_1|AMAZONBLAST|
# |csv_line_2|AMAZONBLAST|
# +----------+-----------+