How to split a string into list and combine two known token into one in python?

For a given string like:

"Today is a bright sunny day in New York"

I want to make my list to be:

['Today','is','a','bright','sunny','day','in','New York']

Another example:

"This is a hello world program"

The list be: ['This', 'is', 'a', 'hello world', 'program']

For every given string S, we have the entities E which needs to be kept together. The first example had entity E to be "New", "York" and the second example had entity to be "hello","world".

I have tried to get it done via regex but I am unsuccessful in splitting by spaces and merging two entities.

Example:

regex = "(navy blue)|[a-zA-Z0-9]*" match = re.findall(regex, "the sky looks navy blue.",re.IGNORECASE) print match

Output: ['', '', '', '', '', '', 'navy blue', '', '']

Solution

Use re.findall instead of split and supply the entity in alternation before the character class that represents string to extract

>>> s = "Today is a bright sunny day in New York"
>>> re.findall(r'New York|\w+', s)
['Today', 'is', 'a', 'bright', 'sunny', 'day', 'in', 'New York']

>>> s = "This is a hello world program"
>>> re.findall(r'hello world|\w+', s)
['This', 'is', 'a', 'hello world', 'program']

change \w to appropriate character class, for ex: [a-zA-Z]

For the additional sample added to question

>>> regex = r"navy blue|[a-z\d]+"
>>> re.findall(regex, "the sky looks navy blue.", re.IGNORECASE)
['the', 'sky', 'looks', 'navy blue']

Use r strings to construct regex patterns as a good practice
grouping not needed here
use + instead of * so that at least one character has to be matched
since re.IGNORECASE is specified, either a-z or A-Z is enough in character class. can also use re.I as short-cut
\d is short-cut for [0-9]