Search code examples
pythonapache-sparkpysparkhdfs

Split file name into different columns of pyspark dataframe


I am using pyspark SQL function input_file_name to add the input file name as a dataframe column.

df = df.withColumn("filename",input_file_name())

The column now has value like below. "abc://dev/folder1/date=20200813/id=1"

From the above column I have to create 2 different columns.

  1. Date
  2. ID

I have to get only date and id from the above file name and populate it to the columns mentioned above.

I can use split_col and get it. But if the folder structure changes then it might be a problem.

Is there a way to check if the file name has string "date" and "id" as part of it and get the values after the equal to symbol and populate it two new columns ?

Below is the expected output.

filename                             date     id
abc://dev/folder1/date=20200813/id=1 20200813 1

Solution

  • You could use regexp_extract with a pattern that looks at the date= and id= substrings:

    df = sc.parallelize(['abc://dev/folder1/date=20200813/id=1', 
                         'def://dev/folder25/id=3/date=20200814'])\
           .map(lambda l: Row(file=l)).toDF()
    
    +-------------------------------------+
    |file                                 |
    +-------------------------------------+
    |abc://dev/folder1/date=20200813/id=1 |
    |def://dev/folder25/id=3/date=20200814|
    +-------------------------------------+
    
    df = df.withColumn('date', f.regexp_extract(f.col('file'), '(?<=date=)[0-9]+', 0))\
           .withColumn('id', f.regexp_extract(f.col('file'), '(?<=id=)[0-9]+', 0))
    df.show(truncate=False)
    

    Which outputs:

    +-------------------------------------+--------+---+
    |file                                 |date    |id |
    +-------------------------------------+--------+---+
    |abc://dev/folder1/date=20200813/id=1 |20200813|1  |
    |def://dev/folder25/id=3/date=20200814|20200814|3  |
    +-------------------------------------+--------+---+