python apache-spark pyspark filtering uppercase

Filter on column values of which first two characters are uppercase

I'm trying to filter a table using Pyspark in which all the two first characters of all values of one of the column start with two uppercase letters such as 'UTrecht', 'NEw York', etc

This is what I've tried but this utterly failed:

df_filtered=df.filter(F.col("column_name").isUpper())

I've also tried:

df_filtered=df.filter(str.isupper,"affiliation_city")

which resulted in the following error:

filter() takes 2 positional arguments but 3 were given

I'm a relative newbie to Pyspark so any advise would be appreciated.

thanks in advance!

Solution

You can see if the substr is upper using substr function and upper function as below

You take the first two letters with substr and compare the with uppercase version if this substring with upper(F.col("test").substr(1, 2)) and filter them as

df.filter(
   F.col("test").substr(1, 2) == F.upper(F.col("test").substr(1, 2))
).show()

If you want more flexibility,(but its not the best way to to i) you can have a udf that returns True/False with python isupper() function

from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType


upp  = udf(lambda x: x[0:2].isupper() if x!= None else False,BooleanType())

df.filter(upp('affiliation_city')).show()