Search code examples
pythonregexpython-2.7tokenize

Python Regexp for extracting tags and words


I have the following string:

str1 = "I/TAG1 like/TAG2 red/TAG3 apples/TAG3 ./TAG4"

And I have two lists in python

tokens = []
tags = []

My desired output would be:

tokens = ['I', 'like', 'red', 'apples', '.']
tags = ['TAG1', 'TAG2', 'TAG3', 'TAG3', 'TAG4']

I am trying to use a regexp like this one:

r"\w*\/"

But that extracts the words with the slash, i.e I/. How can I get the desired output, at least for tokens (get everything before the /)?


Solution

  • You can use:

    >>> re.findall(r'([\w.]+)/([\w.]+)', str1)
    
    [('I', 'TAG1'), ('like', 'TAG2'), ('red', 'TAG3'), ('apples', 'TAG3'), ('.', 'TAG4')]
    

    Code:

    >>> tags=[]
    >>> vals=[]
    >>> for m in re.findall(r'([\w.]+)/([\w.]+)', str1):
    ...     tags.append(m[0])
    ...     vals.append(m[1])
    ...
    
    >>> print tags
    ['I', 'like', 'red', 'apples', '.']
    
    >>> print vals
    ['TAG1', 'TAG2', 'TAG3', 'TAG3', 'TAG4']