Filter Any Non Alpha Numeric In PySpark

Using Pyspark and spacy package and have a data set with tokens where I'm trying to filter out any rows that have a token that contains a symbol or non alpha numeric character.

the
house
#
was
in
the)
400s
w-ow
$crazy

Should only return

the
house
was
in
400s

I tried using something like F.regexp_extract(F.col('TOKEN'), '[^[A-Za-z0-9] ]', 0) but I want to search the entire token not just index 0. I thought about using a contains() statement but that seems like I would have to do a ton of different or statements to capture all the different symbols I want to exclude

Solution

check this out. you can use rlike function and use negation(~) at the filter.

   from pyspark.sql import functions as F

    #INPUT DF
    +------+
    |  text|
    +------+
    |   the|
    | house|
    |     #|
    |   was|
    |    in|
    |  the)|
    |  400s|
    |  w-ow|
    |$crazy|
    +------+

    df.filter(~F.col("text").rlike("[^0-9A-Za-z]")).show()

    #OUTPUT DF
    # +-----+
    # | text|
    # +-----+
    # |  the|
    # |house|
    # |  was|
    # |   in|
    # | 400s|
    # +-----+