I have a DataFrame df
in PySpark
, like a one shown below -
+-----+--------------------+-------+
| ID| customers|country|
+-----+--------------------+-------+
|56 |xyz Limited |U.K. |
|66 |ABC Limited |U.K. |
|16 |Sons & Sons |U.K. |
|51 |TÜV GmbH |Germany|
|23 |Mueller GmbH |Germany|
|97 |Schneider AG |Germany|
|69 |Sahm UG |Austria|
+-----+--------------------+-------+
I would like to keep only those rows where ID
starts from either 5 or 6. So, I want my final dataframe to look like this -
+-----+--------------------+-------+
| ID| customers|country|
+-----+--------------------+-------+
|56 |xyz Limited |U.K. |
|66 |ABC Limited |U.K. |
|51 |TÜV GmbH |Germany|
|69 |Sahm UG |Austria|
+-----+--------------------+-------+
This can be achieved in many ways and it's not a problem. But, I am interested in learning how this can be done using LIKE
statement.
Had I only been interested in those rows where ID
starts from 5, it could have been done easily like this -
df=df.where("ID like ('5%')")
My Question: How can I add the second statement like "ID like ('6%')"
with OR - |
boolean inside where
clause? I want to do something like the one shown below, but this code gives an error. So, in nutshell, how can I use multiple boolean statement using LIKE and .where
here -
df=df.where("(ID like ('5%')) | (ID like ('6%'))")
You can try
df = df.where('ID like "5%" or ID like "6%"')