Search code examples
regexapache-sparkpysparkapache-spark-sqlregex-group

How to use multiple regex groups in pyspark?


I want to insert a symbol between two regex groups.

My code is as follows:

df = spark.createDataFrame([('ab',)], ['str'])
df = df.select(
  concat(
    regexp_extract('str', r'(\w)(\w)', 1),  # extract the first group
    lit(' '),                               # add symbol
    regexp_extract('str', r'(\w)(\w)', 2)   # add the second group
  ).alias('d')).collect()
print(df)

Is there any better way?


Solution

  • You can use regexp_replace with capture groups:

    import pyspark.sql.functions as F
    
    df.select(F.regexp_replace('str', r'(\w)(\w)', '$1 $2').alias('d')).show()
    +---+
    |  d|
    +---+
    |a b|
    +---+