Search code examples
pythonapache-sparkpysparkfilteringuppercase

Filter on column values of which first two characters are uppercase


I'm trying to filter a table using Pyspark in which all the two first characters of all values of one of the column start with two uppercase letters such as 'UTrecht', 'NEw York', etc

This is what I've tried but this utterly failed:

df_filtered=df.filter(F.col("column_name").isUpper())

I've also tried:

df_filtered=df.filter(str.isupper,"affiliation_city")

which resulted in the following error:

filter() takes 2 positional arguments but 3 were given

I'm a relative newbie to Pyspark so any advise would be appreciated.

thanks in advance!


Solution

  • You can see if the substr is upper using substr function and upper function as below

    You take the first two letters with substr and compare the with uppercase version if this substring with upper(F.col("test").substr(1, 2)) and filter them as

    df.filter(
       F.col("test").substr(1, 2) == F.upper(F.col("test").substr(1, 2))
    ).show()
    

    If you want more flexibility,(but its not the best way to to i) you can have a udf that returns True/False with python isupper() function

    from pyspark.sql.functions import udf
    from pyspark.sql.types import BooleanType
    
    
    upp  = udf(lambda x: x[0:2].isupper() if x!= None else False,BooleanType())
    
    df.filter(upp('affiliation_city')).show()