I want to extract just the first numeric instance from the languages field as a date in another column of the pyspark dataframe.
Sample data
data = [
("James","Java_Scala_C++_20230510_2023051345"),
("Mindy", "Spark_Java_20211014_20211014255_C++"),
("Julia", "CSharp_20200115_VB")
]
from pyspark.sql.types import StringType, ArrayType,StructType,StructField
schema = StructType([
StructField("name",StringType(),True),
StructField("languages",StringType(),True)
])
df = spark.createDataFrame(data=data,schema=schema)
df.display()
By using split on the column, I can split the field into an array with what I'm looking for. I can use to_date to convert the string to a date, but would like help selecting the first instance of the numeric field without hardcoding an index which wouldn't work since the number values are in different indexes. I tried regexp_extract which doesn't work with arrays.
from pyspark.sql.functions import *
df = df\
.withColumn('languages_split', split(col('languages'), '_'))
df.display()
Desired output two columns with the following values. String names and dates.
James: 20230510
Mindy: 20211014
Julia: 20200115
Try with regexp_extract
with the 1
capture group.
_(\d{1,8})_
-> capture the group starts with _
and get 8
digits and end with _
.
Example:
data = [
("James","Java_Scala_C++_20230510_2023051345"),
("Mindy", "Spark_Java_20211014_20211014255_C++"),
("Julia", "CSharp_20200115_VB")
]
from pyspark.sql.types import StringType, ArrayType,StructType,StructField
schema = StructType([
StructField("name",StringType(),True),
StructField("languages",StringType(),True)
])
df = spark.createDataFrame(data=data,schema=schema)
from pyspark.sql.functions import *
df = df\
.withColumn('languages_split', regexp_extract(col("languages"),"_(\d{1,8})_",1))
df.show(10,False)
#+-----+-----------------------------------+---------------+
#|name |languages |languages_split|
#+-----+-----------------------------------+---------------+
#|James|Java_Scala_C++_20230510_2023051345 |20230510 |
#|Mindy|Spark_Java_20211014_20211014255_C++|20211014 |
#|Julia|CSharp_20200115_VB |20200115 |
#+-----+-----------------------------------+---------------+