Using Pyspark and spacy package and have a data set with tokens where I'm trying to filter out any rows that have a token that contains a symbol or non alpha numeric character.
the
house
#
was
in
the)
400s
w-ow
$crazy
Should only return
the
house
was
in
400s
I tried using something like
F.regexp_extract(F.col('TOKEN'), '[^[A-Za-z0-9] ]', 0)
but I want to search the entire token not just index 0. I thought about using a contains() statement but that seems like I would have to do a ton of different or statements to capture all the different symbols I want to exclude
check this out. you can use rlike
function and use negation(~)
at the filter.
from pyspark.sql import functions as F
#INPUT DF
+------+
| text|
+------+
| the|
| house|
| #|
| was|
| in|
| the)|
| 400s|
| w-ow|
|$crazy|
+------+
df.filter(~F.col("text").rlike("[^0-9A-Za-z]")).show()
#OUTPUT DF
# +-----+
# | text|
# +-----+
# | the|
# |house|
# | was|
# | in|
# | 400s|
# +-----+