Search code examples
regexpysparkcontainsspacy

Filter Any Non Alpha Numeric In PySpark


Using Pyspark and spacy package and have a data set with tokens where I'm trying to filter out any rows that have a token that contains a symbol or non alpha numeric character.

the
house
#
was
in
the)
400s
w-ow
$crazy

Should only return

the
house
was
in
400s

I tried using something like F.regexp_extract(F.col('TOKEN'), '[^[A-Za-z0-9] ]', 0) but I want to search the entire token not just index 0. I thought about using a contains() statement but that seems like I would have to do a ton of different or statements to capture all the different symbols I want to exclude


Solution

  • check this out. you can use rlike function and use negation(~) at the filter.

       from pyspark.sql import functions as F
    
        #INPUT DF
        +------+
        |  text|
        +------+
        |   the|
        | house|
        |     #|
        |   was|
        |    in|
        |  the)|
        |  400s|
        |  w-ow|
        |$crazy|
        +------+
    
        df.filter(~F.col("text").rlike("[^0-9A-Za-z]")).show()
    
        #OUTPUT DF
        # +-----+
        # | text|
        # +-----+
        # |  the|
        # |house|
        # |  was|
        # |   in|
        # | 400s|
        # +-----+