Search code examples
pythondataframescalaapache-sparkdataset

Getting different number of rows both python and spark scala - dataframe


I'm trying to delete null values of some columns in dataframe but I'm getting different number of rows both python and scala.

I did the same for both. In python I receive 2127178 rows and scala i receive 8723 rows.

For example in python i did:

dfplaneairport.dropna(subset=["model"], inplace= True)
dfplaneairport.dropna(subset=["engine_type"], inplace= True)
dfplaneairport.dropna(subset=["aircraft_type"], inplace= True)
dfplaneairport.dropna(subset=["status"], inplace= True)
dfplaneairport.dropna(subset=["ArrDelay"], inplace= True)
dfplaneairport.dropna(subset=["issue_date"], inplace= True)
dfplaneairport.dropna(subset=["manufacturer"], inplace= True)
dfplaneairport.dropna(subset=["type"], inplace= True)
dfplaneairport.dropna(subset=["tailnum"], inplace= True)
dfplaneairport.dropna(subset=["DepDelay"], inplace= True)
dfplaneairport.dropna(subset=["TaxiOut"], inplace= True)

dfplaneairport.shape
(2127178, 32)

and spark scala i did:

dfairports = dfairports.na.drop(Seq("engine_type", "aircraft_type", "status", "model", "issue_date", "manufacturer", "type","ArrDelay", "DepDelay", "TaxiOut", "tailnum"))

dfairports.count()
8723

I am expecting the same number of rows and i'm don't know what I'm doing wrong


Solution

  • You seem to not be using the Pyspark dropna function, but the Pandas one. Notice the fact that you're using the inplace input argument whereas that does not exist in the Pyspark function.

    Here are 2 bits of code (in Scala and in Pyspark) that behave exactly the same way.

    Scala:

    import spark.implicits._
    
    val df = Seq(
      ("James",null,"Smith","36636","M",3000), ("Michael","Rose",null,"40288","M",4000),
      ("Robert",null,"Williams","42114","M",4000),
      ("Maria","Anne","Jones","39192","F",4000),
      ("Jen","Mary","Brown",null,"F",-1)
    ).toDF("firstname", "middlename", "lastname", "id", "gender", "salary")
    df.show                                                                                                                                                                                                                                                                  
    +---------+----------+--------+-----+------+------+                                                                                                                                                                                                                             
    |firstname|middlename|lastname|   id|gender|salary|                                                                                                                                                                                                                             
    +---------+----------+--------+-----+------+------+                                                                                                                                                                                                                             
    |    James|      null|   Smith|36636|     M|  3000|                                                                                                                                                                                                                             
    |  Michael|      Rose|    null|40288|     M|  4000|                                                                                                                                                                                                                             
    |   Robert|      null|Williams|42114|     M|  4000|                                                                                                                                                                                                                             
    |    Maria|      Anne|   Jones|39192|     F|  4000|                                                                                                                                                                                                                             
    |      Jen|      Mary|   Brown| null|     F|    -1|                                                                                                                                                                                                                             
    +---------+----------+--------+-----+------+------+
    
    df.na.drop(Seq("middlename", "lastname")).show                                                                                                                                                                                                                           
    +---------+----------+--------+-----+------+------+                                                                                                                                                                                                                             
    |firstname|middlename|lastname|   id|gender|salary|                                                                                                                                                                                                                             
    +---------+----------+--------+-----+------+------+                                                                                                                                                                                                                             
    |    Maria|      Anne|   Jones|39192|     F|  4000|                                                                                                                                                                                                                             
    |      Jen|      Mary|   Brown| null|     F|    -1|                                                                                                                                                                                                                             
    +---------+----------+--------+-----+------+------+
    

    Pyspark:

    data = [("James",None,"Smith","36636","M",3000), ("Michael","Rose",None,"40288","M",4000),
        ("Robert",None,"Williams","42114","M",4000),
        ("Maria","Anne","Jones","39192","F",4000),
        ("Jen","Mary","Brown",None,"F",-1)
      ]
    
    df = spark.createDataFrame(data, ["firstname", "middlename", "lastname", "id", "gender", "salary"])
    
    df.show()
    +---------+----------+--------+-----+------+------+                                                                                                                                                                                                                             
    |firstname|middlename|lastname|   id|gender|salary|                                                                                                                                                                                                                             
    +---------+----------+--------+-----+------+------+                                                                                                                                                                                                                             
    |    James|      null|   Smith|36636|     M|  3000|                                                                                                                                                                                                                             
    |  Michael|      Rose|    null|40288|     M|  4000|                                                                                                                                                                                                                             
    |   Robert|      null|Williams|42114|     M|  4000|                                                                                                                                                                                                                             
    |    Maria|      Anne|   Jones|39192|     F|  4000|                                                                                                                                                                                                                             
    |      Jen|      Mary|   Brown| null|     F|    -1|                                                                                                                                                                                                                             
    +---------+----------+--------+-----+------+------+
    
    df.dropna(subset=["middlename", "lastname"]).show()                                                                                                                                                                                                                         
    +---------+----------+--------+-----+------+------+                                                                                                                                                                                                                             
    |firstname|middlename|lastname|   id|gender|salary|                                                                                                                                                                                                                             
    +---------+----------+--------+-----+------+------+                                                                                                                                                                                                                             
    |    Maria|      Anne|   Jones|39192|     F|  4000|                                                                                                                                                                                                                             
    |      Jen|      Mary|   Brown| null|     F|    -1|                                                                                                                                                                                                                             
    +---------+----------+--------+-----+------+------+