python regex apache-spark pyspark apache-spark-sql

split fields in pyspark dataframes

I have a dataframe in pyspark with a filed value as below -

+----------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                       |
+----------------------------------------------------------------------------------------------------------------------------+
|2023-03-05 MONTH   2020M03 2020-03-01  2020-03-31  ADDR    Fargo-Wahpeton, ND-MN   61.30   98.52   433.23                   |
|2023-03-05 MONTH   2020M03 2020-03-01  2020-03-31  STATE   TX  43.38   74.61   380.82                                       |
|2023-03-05 MONTH   2020M03 2020-03-01  2020-03-31  ADDR    Kalamazoo-Battle Creek-Portage, MI  30.19   49.06   266.33       |
|2023-03-05 MONTH   2020M03 2020-03-01  2020-03-31  STATE   TN  11.87   19.92   946.73                                       |
|2023-03-05 MONTH   2020M03 2020-03-01  2020-03-31  ADDR    New York-Newark, NY-NJ-CT-PA    32.24   55.95   322.16           |
|2023-03-05 MONTH   2020M03 2020-03-01  2020-03-31  LOCATION    New England 22.277  42.56   202.76                           |
+----------------------------------------------------------------------------------------------------------------------------+

Now, I would like to split the field value into several fields as below -

+-----------+----------+-----------+--------------+-------------+-------------------+---------------------------------------+---------+--------------+------------------+
|insert_dt  |sub_type  |ret_month  |month_start   |month_end    |area_class         |area_details                           |sub_paid |sub_pending   |sub_annual_amt    |
+-----------+----------+-----------+--------------+-------------+-------------------+---------------------------------------+---------+--------------+------------------+
|2023-03-05 |MONTH     |2020M03    |2020-03-01    |2020-03-31   |addr               |Fargo-Wahpeton, ND-MN                  |61.30    |98.52         |433.23            |
|2023-03-05 |MONTH     |2020M03    |2020-03-01    |2020-03-31   |state              |TX                                     |43.38    |74.61         |380.82            |
|2023-03-05 |MONTH     |2020M03    |2020-03-01    |2020-03-31   |addr               |Kalamazoo-Battle Creek-Portage, MI     |30.19    |49.06         |266.33            |
|2023-03-05 |MONTH     |2020M03    |2020-03-01    |2020-03-31   |state              |TN                                     |11.87    |19.92         |946.73            |
|2023-03-05 |MONTH     |2020M03    |2020-03-01    |2020-03-31   |addr               |New York-Newark, NY-NJ-CT-PA           |32.24    |55.95         |322.16            |
|2023-03-05 |MONTH     |2020M03    |2020-03-01    |2020-03-31   |location           |New England                            |22.27    |42.56         |202.76            |
+-----------+----------|-----------+--------------+-------------+-------------------+---------------------------------------+---------+--------------+------------------+

I tried regex_extract as below but could get this.

regex_pattern = r'^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(.+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)$'

df.select(regexp_extract('value', regex_pattern, 1).alias('insert_dt'),
          regexp_extract('value', regex_pattern, 2).alias('sub_type'),
          regexp_extract('value', regex_pattern, 3).alias('ret_month'),
          regexp_extract('value', regex_pattern, 4).alias('month_start'),
         regexp_extract('value', regex_pattern, 5).alias('month_end'),
          regexp_extract('value', regex_pattern, 6).alias('area_class'),
          regexp_extract('value', regex_pattern, 7).alias('area_details'),
          regexp_extract('value', regex_pattern, 8).alias('sub_paid'),
          regexp_extract('value', regex_pattern, 9).alias('sub_pending'),
          regexp_extract('value', regex_pattern, 10).alias('sub_annual_amt')
          ).display()

could someone suggest me a way to achieve this.

Thanks in advance.

Solution

Try out this regex

regex_pattern = r'^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+([^0-9]+)\s+(\S+)\s+(\S+)?\s+?(\S+)\s+$'

This should group the area_details in the same column