Search code examples
pythonpython-3.xregexpython-re

Regex in python - Parse complex data groups


How can I parse the following data using regex expressions:

Test data 1
  Measurement 1     X            :      -0.100  Y :      2.300
  Something   1                  :       0.00
  Stuff       1                  :       0.00
  Needed      1     X            :      -0.800  Y :      5.300

Test data 2
  Measurement 1     X            :      -0.600  Y :      4.300
  Something   1                  :       0.30
  Stuff       1                  :      -0.20
  Extra       1                  :      -0.800

I want to extract the Measurement 1 data (X and Y values) and the Needed 1 data (X and Y values) from Test data 1

I also want to extract the Measurement 1 data (X and Y values) and the Extra 1 data from Test data 2

The measurements have the same names just under different table names.

for line in data:
  if "Test data 1" in line
    match = re.match (r"   Measurement  1   X          :     ([\-\d\.]+)    Y :       ([\-\d\.]+)\s*$", line)
    if match:
       X_table1 = match.group(1)
       Y_table1 = match.group(2)
  if "Test data 2" in line
     match = re.match (r"   Measurement  1   X          :     ([\-\d\.]+)    Y :       ([\-\d\.]+)\s*$", line)
    if match:
       X_table2 = match.group(1)
       Y_table2 = match.group(2)

Thank you for any help


Solution

  • You're processing your data one line at a time but the X and Y values are on different lines than the segment headers. Because of that, your code needs to remember which segment it currently processes (i.e. a simple parser). Also, you can reuse a generic pattern to extract the X and Y values.

    data1 = data2 = False
    xy_pattern = r'X\s+:\s+([\-\d\.]+)\s+Y\s+:\s+([\-\d\.]+)'
    
    for line in data:
        # set state
        if "Test data 1" in line:
            data1 = True
            continue
        elif "Test data 2" in line:
            data1 = False
            data2 = True
            continue
    
        # extract data
        if data1 and 'Measurement' in line:
            matches = re.findall(xy_pattern, line)
            if matches:
                X_table1, Y_table1 = matches[0]
        elif data2 and 'Measurement' in line:
            matches = re.findall(xy_pattern, line)
            if matches:
                X_table2, Y_table2 = matches[0]
    

    In the same way, you can check for the Extra line. Note however that your matches are still strings so you might want to convert them to floats, depending on what you want to do with them.