Search code examples
pythonregexsplittokenize

How to split a string into list and combine two known token into one in python?


For a given string like:

"Today is a bright sunny day in New York"

I want to make my list to be:

['Today','is','a','bright','sunny','day','in','New York']

Another example:

"This is a hello world program"

The list be: ['This', 'is', 'a', 'hello world', 'program']

For every given string S, we have the entities E which needs to be kept together. The first example had entity E to be "New", "York" and the second example had entity to be "hello","world".

I have tried to get it done via regex but I am unsuccessful in splitting by spaces and merging two entities.

Example:

regex = "(navy blue)|[a-zA-Z0-9]*" match = re.findall(regex, "the sky looks navy blue.",re.IGNORECASE) print match

Output: ['', '', '', '', '', '', 'navy blue', '', '']


Solution

  • Use re.findall instead of split and supply the entity in alternation before the character class that represents string to extract

    >>> s = "Today is a bright sunny day in New York"
    >>> re.findall(r'New York|\w+', s)
    ['Today', 'is', 'a', 'bright', 'sunny', 'day', 'in', 'New York']
    
    >>> s = "This is a hello world program"
    >>> re.findall(r'hello world|\w+', s)
    ['This', 'is', 'a', 'hello world', 'program']
    

    change \w to appropriate character class, for ex: [a-zA-Z]


    For the additional sample added to question

    >>> regex = r"navy blue|[a-z\d]+"
    >>> re.findall(regex, "the sky looks navy blue.", re.IGNORECASE)
    ['the', 'sky', 'looks', 'navy blue']
    
    • Use r strings to construct regex patterns as a good practice
    • grouping not needed here
    • use + instead of * so that at least one character has to be matched
    • since re.IGNORECASE is specified, either a-z or A-Z is enough in character class. can also use re.I as short-cut
    • \d is short-cut for [0-9]