I have a dataframe in pyspark with a filed value
as below -
+----------------------------------------------------------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------------------------------------------------------+
|2023-03-05 MONTH 2020M03 2020-03-01 2020-03-31 ADDR Fargo-Wahpeton, ND-MN 61.30 98.52 433.23 |
|2023-03-05 MONTH 2020M03 2020-03-01 2020-03-31 STATE TX 43.38 74.61 380.82 |
|2023-03-05 MONTH 2020M03 2020-03-01 2020-03-31 ADDR Kalamazoo-Battle Creek-Portage, MI 30.19 49.06 266.33 |
|2023-03-05 MONTH 2020M03 2020-03-01 2020-03-31 STATE TN 11.87 19.92 946.73 |
|2023-03-05 MONTH 2020M03 2020-03-01 2020-03-31 ADDR New York-Newark, NY-NJ-CT-PA 32.24 55.95 322.16 |
|2023-03-05 MONTH 2020M03 2020-03-01 2020-03-31 LOCATION New England 22.277 42.56 202.76 |
+----------------------------------------------------------------------------------------------------------------------------+
Now, I would like to split the field value
into several fields as below -
+-----------+----------+-----------+--------------+-------------+-------------------+---------------------------------------+---------+--------------+------------------+
|insert_dt |sub_type |ret_month |month_start |month_end |area_class |area_details |sub_paid |sub_pending |sub_annual_amt |
+-----------+----------+-----------+--------------+-------------+-------------------+---------------------------------------+---------+--------------+------------------+
|2023-03-05 |MONTH |2020M03 |2020-03-01 |2020-03-31 |addr |Fargo-Wahpeton, ND-MN |61.30 |98.52 |433.23 |
|2023-03-05 |MONTH |2020M03 |2020-03-01 |2020-03-31 |state |TX |43.38 |74.61 |380.82 |
|2023-03-05 |MONTH |2020M03 |2020-03-01 |2020-03-31 |addr |Kalamazoo-Battle Creek-Portage, MI |30.19 |49.06 |266.33 |
|2023-03-05 |MONTH |2020M03 |2020-03-01 |2020-03-31 |state |TN |11.87 |19.92 |946.73 |
|2023-03-05 |MONTH |2020M03 |2020-03-01 |2020-03-31 |addr |New York-Newark, NY-NJ-CT-PA |32.24 |55.95 |322.16 |
|2023-03-05 |MONTH |2020M03 |2020-03-01 |2020-03-31 |location |New England |22.27 |42.56 |202.76 |
+-----------+----------|-----------+--------------+-------------+-------------------+---------------------------------------+---------+--------------+------------------+
I tried regex_extract as below but could get this.
regex_pattern = r'^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(.+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)$'
df.select(regexp_extract('value', regex_pattern, 1).alias('insert_dt'),
regexp_extract('value', regex_pattern, 2).alias('sub_type'),
regexp_extract('value', regex_pattern, 3).alias('ret_month'),
regexp_extract('value', regex_pattern, 4).alias('month_start'),
regexp_extract('value', regex_pattern, 5).alias('month_end'),
regexp_extract('value', regex_pattern, 6).alias('area_class'),
regexp_extract('value', regex_pattern, 7).alias('area_details'),
regexp_extract('value', regex_pattern, 8).alias('sub_paid'),
regexp_extract('value', regex_pattern, 9).alias('sub_pending'),
regexp_extract('value', regex_pattern, 10).alias('sub_annual_amt')
).display()
could someone suggest me a way to achieve this.
Thanks in advance.
Try out this regex
regex_pattern = r'^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+([^0-9]+)\s+(\S+)\s+(\S+)?\s+?(\S+)\s+$'
This should group the area_details
in the same column