Search code examples
pythonregexalphanumeric

Regex to extract ONLY alphanumeric words


I am looking for a regex to extract the word that ONLY contain alphanumeic characters:

string = 'This is a $dollar sign !!'
matches = re.findall(regex, string)
matches = ['This', 'is', 'sign']

This can be done by tokenizing the string and evaluate each token individually using the following regex:

^[a-zA-Z0-9]+$

Due to performance issues, I want to able to extract the alphanumeric tokens without tokenizing the whole string. The closest I got to was

regex = \b[a-zA-Z0-9]+\b

, but it still extracts substrings containing alphanumeric characters:

string = 'This is a $dollar sign !!'
matches = re.findall(regex, string)
matches = ['This', 'is', 'dollar', 'sign']

Is there a regex able to pull this off? I've tried different things but can't come up with a solution.


Solution

  • Instead of word boundaries, lookbehind and lookahead for spaces (or the beginning/end of the string):

    (?:^|(?<= ))[a-zA-Z0-9]+(?= |$)
    

    https://regex101.com/r/TZ7q1c/1

    Note that "a" is a standalone alphanumeric word, so it's included too.

    ['This', 'is', 'a', 'sign']