Search code examples
pythonapache-sparkpysparkapache-spark-sqlnumeric

How to check if a string column in pyspark dataframe is all numeric


I have a PySpark Dataframe with a column of strings. How can I check which rows in it are Numeric. I could not find any function in PySpark's official documentation.

values = [('25q36',),('75647',),('13864',),('8758K',),('07645',)]
df = sqlContext.createDataFrame(values,['ID',])
df.show()
+-----+
|   ID|
+-----+
|25q36|
|75647|
|13864|
|8758K|
|07645|
+-----+

In Python, there is a function .isDigit() which returns True or False if the string contains just numbers or not.

Expected DataFrame:

+-----+-------+
|   ID| Value |
+-----+-------+
|25q36| False |
|75647| True  |
|13864| True  |
|8758K| False |
|07645| True  |
+-----+-------+

I would like to avoid creating a UDF.


Solution

  • A simple cast would do the job :

    from pyspark.sql import functions as F
    
    my_df.select(
      "ID",
      F.col("ID").cast("int").isNotNull().alias("Value ")
    ).show()
    
    +-----+------+
    |   ID|Value |
    +-----+------+
    |25q36| false|
    |75647|  true|
    |13864|  true|
    |8758K| false|
    |07645|  true|
    +-----+------+