Search code examples
regexpysparkextract

Pyspark regex_extract number only from a text string which contains special characters too


I am trying to extract numbers only from a freeText column, and the column will have text like DH-09878877ABC or 9009898DEC or qwert9876788plk.

I just want to extract numbers using below PySpark but it's not working. Please advise

df=df.withColumn("acount_nbr",regexp_extract(df['freeText',r'(^[0-9])',1)

Thanks


Solution

  • If you just want to extract numbers, and assuming the input would have only at most one substring of numbers, you should be using the regex pattern [0-9]+:

    df = df.withColumn("acount_nbr", regexp_extract(df['freeText', r'([0-9]+)', 1)