I am writing a code to capture different variable values in different regex groups to create a data frame. I split each data observation into different text strings. We have four types of text strings:
Type 1:
10 129000 TEMPORARY RAILING (TYPE K) LF 960 27.50 26,400.00
Type 2:
18 (F) 510502 MINOR CONCRETE (MINOR STRUCTURE) LUMP SUM 4 2,740.00 10,960.00
Type 3:
21 839312 DOUBLE THRIE BEAM BARRIER (STEEL POST) LF 21,600 33.35 720,360.00
Text 4:
15 156585 18" CURED-IN-PLACE PIPE LINER EA 1 300.00 300.00
I am writing a code to capture different variable values in different regex groups to create a data frame. The data frame would eventually look like this:
v1 | v2 | v3 | v4 | v5 | v6 |
---|---|---|---|---|---|
10 | 129000 | TEMPORARY RAILING (TYPE K) | LF | 960 | 27.50 |
18 | 510502 | MINOR CONCRETE (MINOR STRUCTURE) | LUMP SUM | 4 | 2,740.00 |
21 | 839312 | DOUBLE THRIE BEAM BARRIER (STEEL POST) | LF | 21,600 | 33.35 |
I am working on creating one regex pattern that can do that for us. Right now, I have the pattern.^(\s{6}|\s{7})(\d+)\s+(?!(F))(\d+)\s+([A-Z.]+\s\w+[^\n\d]*)((?:^|\s)(?=.)((?:0|(?:[1-9](?:\d*|\d{0,2}(?:,\d{3})*)))?(?:\.\d*[0-9])?)(?!\S))\s+((?:^|\s)(?=.)((?:0|(?:[1-9](?:\d*|\d{0,2}(?:,\d{3})*)))?(?:\.\d*[0-9])?)(?!\S))\s+((?:^|\s)(?=.)((?:0|(?:[1-9](?:\d*|\d{0,2}(?:,\d{3})*)))?(?:\.\d*[0-9])?)(?!\S))
. It captures the following:
Text 1:
group 2: 10
; group 4: 129000
; group 5: TEMPORARY RAILING (TYPE K) LF
; group 7: 960
; group 9: 27.50
; group 11: 26,400.00
Text 2: Does not capture anything because of (F). I think we can work on removing this before running the regex search.
Text 3:
group 2: 21
; group 4: 839312
; group 5: DOUBLE THRIE BEAM BARRIER (STEEL POST) LF
; group 7: 21,600
; group 9: 33.35
; group 11: 720,360.00
Text 4:
It does not capture anything due to 18"
([A-Z.]+\s\w+[^\n\d]*)
), is there a way to tell regex that we want it to stop capturing beyond the first instance of 2 whitespaces? That should solve that problem?(F)
and (S)
after group 2 ((\d+)
) is captured? That should solve the problem?Thank you so much for your help in advance!
Try:
import re
data = '''\
10 129000 TEMPORARY RAILING (TYPE K) LF 960 27.50 26,400.00
11 129100 TEMPORARY CRASH CUSHION MODULE EA 56 127.00 7,112.00
12 150662 REMOVE METAL BEAM GUARD RAILING LF 1,390 3.00 4,170.00
13 153210 REMOVE CONCRETE CY 2 660.00 1,320.00
14 015310 REMOVE BRIDGE APPROACH GUARD RAILING LF 200 6.30 1,260.00
15 156585 18" CURED-IN-PLACE PIPE LINER EA 1 300.00 300.00
16 160101 CLEARING AND GRUBBING LS LUMP SUM 2,500.00 2,500.00
17 190110 LEAD COMPLIANCE PLAN LS LUMP SUM 850.00 850.00
18 (F) 510502 MINOR CONCRETE (MINOR STRUCTURE) CY 4 2,740.00 10,960.00
19 820118 GUARD RAILING DELINEATOR EA 12 15.00 180.00
20 839303 SINGLE THRIE BEAM BARRIER (STEEL POST) LF 3,630 22.00 79,860.00
21 839312 DOUBLE THRIE BEAM BARRIER (STEEL POST) LF 21,600 33.35 720,360.00
22 839542 TRANSITION RAILING (TYPE DTB) EA 4 2,600.00 10,400.00
'''
data = data.replace('(F)', ' ') # 3 spaces!
df = pd.DataFrame([re.split(r'\s{2,}', row) for row in data.splitlines()]).add_prefix('v').loc[:,'v1':]
print(df.to_markdown(index=False))
Prints:
v1 | v2 | v3 | v4 | v5 | v6 | v7 |
---|---|---|---|---|---|---|
10 | 129000 | TEMPORARY RAILING (TYPE K) | LF | 960 | 27.50 | 26,400.00 |
11 | 129100 | TEMPORARY CRASH CUSHION MODULE | EA | 56 | 127.00 | 7,112.00 |
12 | 150662 | REMOVE METAL BEAM GUARD RAILING | LF | 1,390 | 3.00 | 4,170.00 |
13 | 153210 | REMOVE CONCRETE | CY | 2 | 660.00 | 1,320.00 |
14 | 015310 | REMOVE BRIDGE APPROACH GUARD RAILING | LF | 200 | 6.30 | 1,260.00 |
15 | 156585 | 18" CURED-IN-PLACE PIPE LINER | EA | 1 | 300.00 | 300.00 |
16 | 160101 | CLEARING AND GRUBBING | LS | LUMP SUM | 2,500.00 | 2,500.00 |
17 | 190110 | LEAD COMPLIANCE PLAN | LS | LUMP SUM | 850.00 | 850.00 |
18 | 510502 | MINOR CONCRETE (MINOR STRUCTURE) | CY | 4 | 2,740.00 | 10,960.00 |
19 | 820118 | GUARD RAILING DELINEATOR | EA | 12 | 15.00 | 180.00 |
20 | 839303 | SINGLE THRIE BEAM BARRIER (STEEL POST) | LF | 3,630 | 22.00 | 79,860.00 |
21 | 839312 | DOUBLE THRIE BEAM BARRIER (STEEL POST) | LF | 21,600 | 33.35 | 720,360.00 |
22 | 839542 | TRANSITION RAILING (TYPE DTB) | EA | 4 | 2,600.00 | 10,400.00 |
Or you can use pd.read_fwf
:
from io import StringIO
data = data.replace('(F)', ' ') # 3 spaces!
df = pd.read_fwf(StringIO(data), header=None)
print(df.to_markdown(index=False))