I have some descriptions and tags for each of the token in descriptions. Tags specify the type of token. I want to extract the address out of the descriptions, that is: all tokens corresponding to , and . How can I achieve this in pyspark.
|description |tags|
+-------------------------------------------+--------------------------------------------------------------------------+
|"aci*credit one bank, n" |<vendor_name> <vendor_name> <vendor_name> <vendor_name> |
|odot dmv2u 503-9455400 or 06/30 |<vendor_name> <vendor_name> <phone_number> <state> <trans_date> |
|# 7-eleven 41066 5050 hunter rd ooltewah tn|<other> <vendor_name> <store_id> <street> <street> <street> <city> <state>|
Output I am looking for is:
NULL
OR
5050 hunter rd ooltewah tn
Anything which is not an address tag, should not be included.
Check out this solution:
import pyspark.sql.functions as f
df = spark.createDataFrame([
('"aci*credit one bank, n"', '<vendor_name> <vendor_name> <vendor_name> <vendor_name>'),
('odot dmv2u 503-9455400 or 06/30', '<vendor_name> <vendor_name> <phone_number> <state> <trans_date>'),
('# 7-eleven 41066 5050 hunter rd ooltewah tn', '<other> <vendor_name> <store_id> <street> <street> <street> <city> <state>')
], ['description', 'tags'])
address_tags = ['<state>', '<street>', '<city>']
address_tags_concatenated = '"' + '","'.join(address_tags) + '"'
df = (
df
# Can't use maps because there are duplicate tag values.
.withColumn('content_zip', f.arrays_zip(f.split(f.col('description'), ' ').alias('description'), f.split(f.col('tags'), ' ').alias('tag')))
.withColumn('content_zip_filtered', f.expr(f'filter(content_zip, x -> x.tag in ({address_tags_concatenated}))'))
.select(f.concat_ws(" ", f.col('content_zip_filtered.description')).alias('address'))
)
df.show(truncate=False)
And the output:
+--------------------------+
|address |
+--------------------------+
| |
|or |
|5050 hunter rd ooltewah tn|
+--------------------------+