Search code examples
pythonregextokenizelexical-analysis

How to tokenize the sample string using Regular Expression in Python?


I am new to regular expression. On top of finding out the pattern to match the following string, please also point out references and/or samples web sites.

The data string

1.  First1 Last1 - 20 (Long Description) 
2.  First2 Last2 - 40 (Another Description)

I want to be able to extract tuples {First1,Last1,20} and {First2,Last2,40} from the above string.


Solution

  • Thisone seems ok: http://docs.python.org/howto/regex.html#regex-howto Just skim it over, try some examples. regexpes are a little tricky (basicly a little programming language), and require some time to learn, but they are very useful to know. Just experiment and take one step at a time.

    (yes, I could just give you the answer, but fish, man, teach)

    ...

    as reqested, a solution when you don't use the split() solution: iterate over the lines, and check for each line:

    p = re.compile('\d+\.\s+(\w+)\s+(\w+)\s+-\s+(\d+)')
    m = p.match(the_line)
    // m.group(0) will be the first word
    // m.group(1) the second word
    // m.group(2) will be the firstnumber after the last word.
    
    The regexp is :<some digits><a dot>
    <some whitespace><alphanumeric characters, captured as group 0>
    <some whtespace><alphanumeric characters, captured as group 1>
    <some whitespace><a '-'><some witespace><digits, captured as group 2>
    

    it's a little strict, but that way you'll catch non-conforming lines.