Search code examples
pythonregextokenizetext-parsing

Python tokenize sentence with optional key/val pairs


I'm trying to parse a sentence (or line of text) where you have a sentence and optionally followed some key/val pairs on the same line. Not only are the key/value pairs optional, they are dynamic. I'm looking for a result to be something like:

Input:

"There was a cow at home. home=mary cowname=betsy date=10-jan-2013"

Output:

Values = {'theSentence' : "There was a cow at home.",
          'home' : "mary",
          'cowname' : "betsy",
          'date'= "10-jan-2013"
         }

Input:

"Mike ordered a large hamburger. lastname=Smith store=burgerville"

Output:

Values = {'theSentence' : "Mike ordered a large hamburger.",
          'lastname' : "Smith",
          'store' : "burgerville"
         }

Input:

"Sam is nice."

Output:

Values = {'theSentence' : "Sam is nice."}

Thanks for any input/direction. I know the sentences appear that this is a homework problem, but I'm just a python newbie. I know it's probably a regex solution, but I'm not the best regarding regex.


Solution

  • I'd use re.sub:

    import re
    
    s = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"
    
    d = {}
    
    def add(m):
        d[m.group(1)] = m.group(2)
    
    s = re.sub(r'(\w+)=(\S+)', add, s)
    d['theSentence'] = s.strip()
    
    print d
    

    Here's more compact version if you prefer:

    d = {}
    d['theSentence'] = re.sub(r'(\w+)=(\S+)',
        lambda m: d.setdefault(m.group(1), m.group(2)) and '',
        s).strip()
    

    Or, maybe, findall is a better option:

    rx = '(\w+)=(\S+)|(\S.+?)(?=\w+=|$)'
    d = {
        a or 'theSentence': (b or c).strip()
        for a, b, c in re.findall(rx, s)
    }
    print d