Search code examples
regexgoogle-bigquerynon-alphanumeric

How can I select all records with non-alphanumeric and remove them?


I want to select the records containing non-alphanumeric and remove those symbols from strings. The result I expecting is strings with only numbers and letters.

I'm not really familiar with regular expression and sometime it's really confusing. The code below is from answers to similar questions. But it also returns records having only letters and space. I also tried to use /s in case some spaces are not spaces but tabs instead. But I got the same result.

Also, I want to remove all symbols, characters excepting letters, numbers and spaces. I found a function named removesymbols from google could reference. But it seems this function does not exist at all. The website introduces removesymbols is https://cloud.google.com/dataprep/docs/html/REMOVESYMBOLS-Function_57344727. How can I remove all symbols? I don't want to use replace because there are a lot of symbols and I don't know all kinds of non-alphanumeric they have.

-- the code here only shows I want to select all records with non-alphanumeric
SELECT EMPLOYER
FROM fec.work
WHERE EMPLOYER NOT LIKE '[^a-zA-Z0-9/s]+'
GROUP BY 1;

Solution

  • I suggest using REGEXP_REPLACE for select, to remove the characters, and using REGEXP_CONTAINS to get only the one you want.

    SELECT REGEXP_REPLACE(EMPLOYER, r'[^a-zA-Z\d\s]', '') 
    FROM fec.work
    WHERE REGEXP_CONTAINS(EMPLOYER, r'[^a-zA-Z\d\s]')
    

    You say you don't want to use replace because you don't know how many alphanumerical there is. But instead of listing all non-alphanumerical, why not use ^ to get all but alphanumerical ?

    EDIT :

    To complete with what Mikhail answered, you have multiple choices for your regex :

    '[^a-zA-Z\\d\\s]'  // Basic regex
    r'[^a-zA-Z\d\s]'   // Uses r to avoid escaping
    r'[^\w\s]'         // \w = [a-zA-Z0-9_] (! underscore as alphanumerical !)
    

    If you don't consider underscores to be alphanumerical, you should not use \w