Search code examples
pythonnumberspysparkstripapache-spark-sql

strip numbers from pyspark dataframe column of type string


I'm working on dataframe in pyspark. I've dataframe df and column col_1 which is array type and contains numbers as well.

Is there built in function to remove numbers from this string?

Dataframe schema:

>>> df.printSchema()
root
 |-- col_1: array (nullable = true)
 |    |-- element: string (containsNull = true)

Sample Values in Column:

>>>df.select("col_1").show(2,truncate=False)

+-------------------------------------------------------------------------------+
|col_1                                                                                                                                   
+-------------------------------------------------------------------------------+
|[use, bal, trans, ck, pay, billor, trans, cc, balances, got, grat, thnxs]                                                                  |
|[hello, like, farther, lower, apr, 11, 49, thank]|
+-------------------------------------------------------------------------------+

In this case, I'm looking for function which would strip number 11, 49 from second row. Thank you.


Solution

  • here is something you can try -

    # Data preparation => 
    data = [[['use', 'bal', 'trans', 'ck', 'pay', 'billor', 'trans', 'cc', 'balances', 'got', 'grat', 'thnxs']],
            [['hello', 'like', 'farther', 'lower', 'apr', '11', '49', 'thank']]]
    
    df = sc.parallelize(data).toDF(["arr"])
    df.printSchema()
    

    :

    root
     |-- arr: array (nullable = true)
     |    |-- element: string (containsNull = true)
    

    :

    from pyspark.sql.functions import explode,regexp_extract,col
    
    df.select(explode(df.arr).alias('elements'))\
      .select(regexp_extract('elements','\d+',0)\
      .alias('Numbers'))\
      .filter(col('Numbers') != '').show()
    

    Output :

    +-------+
    |Numbers|
    +-------+
    |     11|
    |     49|
    +-------+