Search code examples
apache-sparkpysparkapache-spark-sqlsplitregexp-replace

How to get first string value with numbers in pyspark array


I want to extract just the first numeric instance from the languages field as a date in another column of the pyspark dataframe.

Sample data

data = [
 ("James","Java_Scala_C++_20230510_2023051345"),
 ("Mindy", "Spark_Java_20211014_20211014255_C++"),
 ("Julia", "CSharp_20200115_VB")
]

from pyspark.sql.types import StringType, ArrayType,StructType,StructField
schema = StructType([ 
    StructField("name",StringType(),True), 
    StructField("languages",StringType(),True)
  ])

df = spark.createDataFrame(data=data,schema=schema)
df.display()

By using split on the column, I can split the field into an array with what I'm looking for. I can use to_date to convert the string to a date, but would like help selecting the first instance of the numeric field without hardcoding an index which wouldn't work since the number values are in different indexes. I tried regexp_extract which doesn't work with arrays.

from pyspark.sql.functions import *
df = df\
.withColumn('languages_split', split(col('languages'), '_'))
df.display()

Desired output two columns with the following values. String names and dates.

James: 20230510
Mindy: 20211014
Julia: 20200115

Solution

  • Try with regexp_extract with the 1 capture group.

    _(\d{1,8})_ -> capture the group starts with _ and get 8 digits and end with _.

    Example:

    data = [
     ("James","Java_Scala_C++_20230510_2023051345"),
     ("Mindy", "Spark_Java_20211014_20211014255_C++"),
     ("Julia", "CSharp_20200115_VB")
    ]
    
    from pyspark.sql.types import StringType, ArrayType,StructType,StructField
    schema = StructType([ 
        StructField("name",StringType(),True), 
        StructField("languages",StringType(),True)
      ])
    
    df = spark.createDataFrame(data=data,schema=schema)
    
    from pyspark.sql.functions import *
    df = df\
    .withColumn('languages_split', regexp_extract(col("languages"),"_(\d{1,8})_",1))
    df.show(10,False)
    #+-----+-----------------------------------+---------------+
    #|name |languages                          |languages_split|
    #+-----+-----------------------------------+---------------+
    #|James|Java_Scala_C++_20230510_2023051345 |20230510       |
    #|Mindy|Spark_Java_20211014_20211014255_C++|20211014       | 
    #|Julia|CSharp_20200115_VB                 |20200115       |
    #+-----+-----------------------------------+---------------+