For a given string like:
"Today is a bright sunny day in New York"
I want to make my list to be:
['Today','is','a','bright','sunny','day','in','New York']
Another example:
"This is a hello world program"
The list be:
['This', 'is', 'a', 'hello world', 'program']
For every given string S, we have the entities E which needs to be kept together. The first example had entity E to be "New", "York" and the second example had entity to be "hello","world".
I have tried to get it done via regex but I am unsuccessful in splitting by spaces and merging two entities.
Example:
regex = "(navy blue)|[a-zA-Z0-9]*"
match = re.findall(regex, "the sky looks navy blue.",re.IGNORECASE)
print match
Output:
['', '', '', '', '', '', 'navy blue', '', '']
Use re.findall
instead of split
and supply the entity in alternation before the character class that represents string to extract
>>> s = "Today is a bright sunny day in New York"
>>> re.findall(r'New York|\w+', s)
['Today', 'is', 'a', 'bright', 'sunny', 'day', 'in', 'New York']
>>> s = "This is a hello world program"
>>> re.findall(r'hello world|\w+', s)
['This', 'is', 'a', 'hello world', 'program']
change \w
to appropriate character class, for ex: [a-zA-Z]
For the additional sample added to question
>>> regex = r"navy blue|[a-z\d]+"
>>> re.findall(regex, "the sky looks navy blue.", re.IGNORECASE)
['the', 'sky', 'looks', 'navy blue']
r
strings to construct regex patterns as a good practice+
instead of *
so that at least one character has to be matchedre.IGNORECASE
is specified, either a-z
or A-Z
is enough in character class. can also use re.I
as short-cut\d
is short-cut for [0-9]