Search code examples
sqlapache-sparkpysparknullapache-spark-sql

Comparison operator in PySpark (not equal/ !=)


I am trying to obtain all rows in a dataframe where two flags are set to '1' and subsequently all those that where only one of two is set to '1' and the other NOT EQUAL to '1'

With the following schema (three columns),

df = sqlContext.createDataFrame([('a',1,'null'),('b',1,1),('c',1,'null'),('d','null',1),('e',1,1)], #,('f',1,'NaN'),('g','bla',1)],
                            schema=('id', 'foo', 'bar')
                            )

I obtain the following dataframe:

+---+----+----+
| id| foo| bar|
+---+----+----+
|  a|   1|null|
|  b|   1|   1|
|  c|   1|null|
|  d|null|   1|
|  e|   1|   1|
+---+----+----+

When I apply the desired filters, the first filter (foo=1 AND bar=1) works, but not the other (foo=1 AND NOT bar=1)

foobar_df = df.filter( (df.foo==1) & (df.bar==1) )

yields:

+---+---+---+
| id|foo|bar|
+---+---+---+
|  b|  1|  1|
|  e|  1|  1|
+---+---+---+

Here is the non-behaving filter:

foo_df = df.filter( (df.foo==1) & (df.bar!=1) )
foo_df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
+---+---+---+

Why is it not filtering? How can I get the columns where only foo is equal to '1'?


Solution

  • To filter null values try:

    foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )

    https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull