regex python-3.x dataframe text-extraction data-extraction

Extract Positive and Negative float values in raw text file using Python

So I'm trying to extract certain values from a raw text file like this

Number of zero columns: 4
Memory requirement - global matrix: 1571340 solver (totally): 1571340
P1127_VELOCITIES #001000  Step:    59  Iteration:     2  Time:    0.04055  0.0015347 
P2243_VELOCITIES #001000  Step:    59  Iteration:     2  Time:    0.04055  0.0017193 
P3387_VELOCITIES #001000  Step:    59  Iteration:     2  Time:    0.04055  0.0015347 
% of load in interval  Step:    59  Iteration:     2  Time:    0.04055  0.0400000  0.0400000 
summation % of load in interval  Step:    59  Iteration:     2  Time:    0.04055  0.0800000 

Number of zero columns: 4
Memory requirement - global matrix: 1571340 solver (totally): 1571340
P1127_VELOCITIES #001000  Step:    59  Iteration:     2  Time:    0.01638 -0.0016876 
P2243_VELOCITIES #001000  Step:    59  Iteration:     2  Time:    0.01638 -0.0018896 
P3387_VELOCITIES #001000  Step:    59  Iteration:     2  Time:    0.01638 -0.0016876 
% of load in interval  Step:    59  Iteration:     2  Time:    0.01638  0.0400000  0.0400000 
summation % of load in interval  Step:    59  Iteration:     2  Time:    0.01638  0.0800000

So I want to extract P1127_VELOCITIES by using this code:

P1127_positive = re.compile(r'P1127_VELOCITIES #001000  Step:    (\d+)  Iteration:     (\d+)  Time:    (\d+\.\d+)  (\d*\.\d+|-\d*\.\d+)')

P1127_negative = re.compile(r'P1125_VELOCITIES #001000  Step:    (\d+)  Iteration:     (\d+)  Time:    (\d+\.\d+) (\d*\.\d+|-\d*\.\d+)')


def Extract_Data(filepath, expression_positive, expression_negative, data):

    velocity_list = []
    time_list = []
    #negative_data = []

    with open(filepath) as file:
        for line in file:
            data.extend(expression_positive.findall(line))

    with open(filepath) as file:
        for line in file:
            data.extend(expression_negative.findall(line))
    print(data[0])
    print(data[1])
    for data_tuple in data:
        step, iteration, time, velocity = data_tuple
        velocity_list.append(float(velocity))
        time_list.append(float(time))



    return velocity_list, time_list

However, I want to extract all float values at the right end, not positive and negative values separately. As you can see in the text file, the positive values have 2 spaces (i.e. Time: 0.04055[space][space]0.0015347 while the negative values only have 1 space (i.e.Time: 0.01638[space]-0.0016876)

Is there a way to extract both values using re.compile? (like what I have above but extract both). What expressions would you recommend? (i.e. ([-+]?\d\.\d+))

Cheers!

Solution

The regexes in the provided code seem like overkill for the file you've provided. I don't see any reason for them to be so rigid that changing one character requires a new pattern. It doesn't seem like there'll be enough minute variation in the file to be quite so specific about the number of spaces and formatting in a line.

This snippet does the job cleanly on the file you've shared (I'm using append rather than extend so that each row's time pair is preserved). It's simple to add more requirements to match lines more specifically as needed (if you wish to specify a step or iteration, for example). You can also build the regex pattern dynamically if you'd like to drop this into a function and use it to filter by different velocity values.

import re

pattern = r"P1127_VELOCITIES.+?Time:\s*(\S+)\s+(\S+)\s*$" 
data = []

with open("file.txt") as f:
    for line in f:
        m = re.match(pattern, line)

        if m: 
            data.append(tuple(map(float, m.groups())))

print(data)

Output:

[(0.04055, 0.0015347), (0.01638, -0.0016876)]