Search code examples
apache-sparkpysparkapache-spark-1.6

Removing NULL , NAN, empty space from PySpark DataFrame


I have a dataframe in PySpark which contains empty space, Null, and Nan. I want to remove rows which have any of those. I tried below commands, but, nothing seems to work.

myDF.na.drop().show()
myDF.na.drop(how='any').show()

Below is the dataframe:

+---+----------+----------+-----+-----+
|age|  category|      date|empId| name|
+---+----------+----------+-----+-----+
| 25|electronic|17-01-2018|  101|  abc|
| 24|    sports|16-01-2018|  102|  def|
| 23|electronic|17-01-2018|  103|  hhh|
| 23|electronic|16-01-2018|  104|  yyy|
| 29|       men|12-01-2018|  105| ajay|
| 31|      kids|17-01-2018|  106|vijay|
|   |       Men|       nan|  107|Sumit|
+---+----------+----------+-----+-----+

What am I missing? What is the best way to tackle NULL, Nan or empty spaces so that there is no problem in the actual calculation?


Solution

  • NaN (not a number) has different meaning that NULL and empty string is just a normal value (can be converted to NULL automatically with csv reader) so na.drop won't match these.

    You can convert all to null and drop

    from pyspark.sql.functions import col, isnan, when, trim
    
    df = spark.createDataFrame([
        ("", 1, 2.0), ("foo", None, 3.0), ("bar", 1, float("NaN")), 
        ("good", 42, 42.0)])
    
    def to_null(c):
        return when(~(col(c).isNull() | isnan(col(c)) | (trim(col(c)) == "")), col(c))
    
    
    df.select([to_null(c).alias(c) for c in df.columns]).na.drop().show()
    # +----+---+----+
    # |  _1| _2|  _3|
    # +----+---+----+
    # |good| 42|42.0|
    # +----+---+----+