Search code examples
dictionarypysparkapache-spark-sqlextract

Extract specific dictionary value from dataframe in PySpark


I have a below dataframe

dataDictionary = [('value1', [{'key': 'Fruit', 'value': 'Apple'}, {'key': 'Colour', 'value': 'White'}]), 
                 ('value2', [{'key': 'Fruit', 'value': 'Mango'}, {'key': 'Bird', 'value': 'Eagle'}, {'key': 'Colour', 'value': 'Black'}])] 

df = spark.createDataFrame(data=dataDictionary)
df.printSchema()
df.show(truncate=False)
+------+------------------------------------------------------------------------------------------------+
|_1    |_2                                                                                              |
+------+------------------------------------------------------------------------------------------------+
|value1|[{value -> Apple, key -> Fruit}, {value -> White, key -> Colour}]                               |
|value2|[{value -> Mango, key -> Fruit}, {value -> Eagle, key -> Bird}, {value -> Black, key -> Colour}]|
+------+------------------------------------------------------------------------------------------------+

I wanted to extract only the values of key -> Colour. The result should be,

White
Black

I have tried multiple options using regexp_extract_all as well as substring with instr, result is always an empty value. Any suggestion would be appreciated.

result = spark.sql("""select
                    regexp_extract('_2', '''key': 'Colour' + '(\\w+)') as value
                    from table 
                    """)

Solution

  • It is preferred to use pyspark built-in functions, which have guaranteed performance and convenience.

    First, use the filter function to obtain the map whose key is Colour in the _2 column (array type), then take its first (index 0) element, and finally get the value whose key is value in the map.

    from pyspark.sql import SparkSession, functions as F
    
    ...
    df = df.select('_1', F.filter('_2', lambda x: x['key'] == 'Colour')[0]['value'])