Search code examples
arraysstringpyspark

extract address from a text in pyspark


I have some descriptions and tags for each of the token in descriptions. Tags specify the type of token. I want to extract the address out of the descriptions, that is: all tokens corresponding to , and . How can I achieve this in pyspark.

|description                                |tags|
+-------------------------------------------+--------------------------------------------------------------------------+
|"aci*credit one bank, n"                   |<vendor_name> <vendor_name> <vendor_name> <vendor_name>                   |
|odot dmv2u 503-9455400 or 06/30            |<vendor_name> <vendor_name> <phone_number> <state> <trans_date>           |
|# 7-eleven 41066 5050 hunter rd ooltewah tn|<other> <vendor_name> <store_id> <street> <street> <street> <city> <state>|

Output I am looking for is:

NULL
OR 
5050 hunter rd ooltewah tn

Anything which is not an address tag, should not be included.


Solution

  • Check out this solution:

    import pyspark.sql.functions as f
    
    df = spark.createDataFrame([
        ('"aci*credit one bank, n"', '<vendor_name> <vendor_name> <vendor_name> <vendor_name>'),
        ('odot dmv2u 503-9455400 or 06/30', '<vendor_name> <vendor_name> <phone_number> <state> <trans_date>'),
        ('# 7-eleven 41066 5050 hunter rd ooltewah tn', '<other> <vendor_name> <store_id> <street> <street> <street> <city> <state>')
    ], ['description', 'tags'])
    
    address_tags = ['<state>', '<street>', '<city>']
    address_tags_concatenated = '"' + '","'.join(address_tags) + '"'
    df = (
        df
        # Can't use maps because there are duplicate tag values.
        .withColumn('content_zip', f.arrays_zip(f.split(f.col('description'), ' ').alias('description'), f.split(f.col('tags'), ' ').alias('tag')))
        .withColumn('content_zip_filtered', f.expr(f'filter(content_zip, x -> x.tag in ({address_tags_concatenated}))'))
        .select(f.concat_ws(" ", f.col('content_zip_filtered.description')).alias('address'))
    )
    
    df.show(truncate=False)
    

    And the output:

    +--------------------------+                                                    
    |address                   |
    +--------------------------+
    |                          |
    |or                        |
    |5050 hunter rd ooltewah tn|
    +--------------------------+