I'm trying to filter a table using Pyspark in which all the two first characters of all values of one of the column start with two uppercase letters such as 'UTrecht', 'NEw York', etc
This is what I've tried but this utterly failed:
df_filtered=df.filter(F.col("column_name").isUpper())
I've also tried:
df_filtered=df.filter(str.isupper,"affiliation_city")
which resulted in the following error:
filter() takes 2 positional arguments but 3 were given
I'm a relative newbie to Pyspark so any advise would be appreciated.
thanks in advance!
You can see if the substr is upper using substr function and upper function as below
You take the first two letters with substr
and compare the with uppercase version if this substring with upper(F.col("test").substr(1, 2))
and filter them as
df.filter(
F.col("test").substr(1, 2) == F.upper(F.col("test").substr(1, 2))
).show()
If you want more flexibility,(but its not the best way to to i) you can have a udf that returns True/False with python isupper() function
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
upp = udf(lambda x: x[0:2].isupper() if x!= None else False,BooleanType())
df.filter(upp('affiliation_city')).show()