Search code examples
fileapache-sparkpysparksubstringfilenames

How to pick the substring from a filename and load that substring into a table column in PySpark?


We are reading files from ADLS through this kind of command:

relative_path = "ContainerName/"
input_file_name = "ss_old_data_Amazon_Blast_202211093837474.csv"

How can we fetch a substring from a file name that we are getting for processing.

For example:
1st filename: vm_Path_Accenture_Complex_Union_202211027373.csv
2nd filename: vm_path_Google_is_a_good_company_20221109473.csv
3rd filename: ss_old_data_Amazon_Blast_202211093837474.csv
4th filename: ss_old_data_Black_Adam_(TY)_2022847093837474.csv
5th filename: ss_old_data_Man_of_steel_(PQ)_2022847093837474.csv

We need to pick specific substring from the filename and remove _ from substring and make it upper case. I don't need to pick the date format (numbers) or .csv - just the company name. vm_path will be there at the beginning for most of the files, and sometimes ss_old_data. We need to remove this part too.

Expected output:
1st filename should become: ACCENTURECOMPLEXUNION
2nd filename should become: GOOGLEISAGOODCOMPANY
3rd filename should become: AMAZONBLAST
4th filename should become : BLACKADAM
5th filename should become: MANOFSTEEL

How can we achieve it in PySpark?


Solution

  • Since you store your file name in a variable, you can reuse it and save it as a literal (lit). But before saving you will have to do some modifications to the literal.

    ss_old_data_Amazon_Blast_202211093837474.csv <- input
    _Amazon_Blast <- after applying regexp_extract with pattern '(?i)(vm_path|ss_old_data)(.+)_'
    AmazonBlast <- after applying translate(..., '_', '')
    AMAZONBLAST <- after applying upper

    from pyspark.sql import functions as F
    
    input_file_name ="ss_old_data_Amazon_Blast_202211093837474.csv"
    
    df = spark.read.csv(input_file_name)
    df = df.withColumn('file_name', F.lit(input_file_name))
    
    c = F.regexp_extract('file_name', '(?i)(vm_path|ss_old_data)([^(]+)(\(.+)?_', 2)
    df = df.withColumn('file_name', F.upper(F.translate(c, '_', '')))
                       
    df.show()
    # +----------+-----------+
    # |       _c0|  file_name|
    # +----------+-----------+
    # |csv_line_1|AMAZONBLAST|
    # |csv_line_2|AMAZONBLAST|
    # +----------+-----------+