Search code examples
pythonregexpython-re

Capturing different text type in different groups


I am writing a code to capture different variable values in different regex groups to create a data frame. I split each data observation into different text strings. We have four types of text strings:

Type 1:

10 129000 TEMPORARY RAILING (TYPE K) LF 960 27.50 26,400.00

Type 2:

18 (F) 510502 MINOR CONCRETE (MINOR STRUCTURE) LUMP SUM 4 2,740.00 10,960.00

Type 3: 21 839312 DOUBLE THRIE BEAM BARRIER (STEEL POST) LF 21,600 33.35 720,360.00

Text 4: 15 156585 18" CURED-IN-PLACE PIPE LINER EA 1 300.00 300.00

I am writing a code to capture different variable values in different regex groups to create a data frame. The data frame would eventually look like this:

v1 v2 v3 v4 v5 v6
10 129000 TEMPORARY RAILING (TYPE K) LF 960 27.50
18 510502 MINOR CONCRETE (MINOR STRUCTURE) LUMP SUM 4 2,740.00
21 839312 DOUBLE THRIE BEAM BARRIER (STEEL POST) LF 21,600 33.35

I am working on creating one regex pattern that can do that for us. Right now, I have the pattern.^(\s{6}|\s{7})(\d+)\s+(?!(F))(\d+)\s+([A-Z.]+\s\w+[^\n\d]*)((?:^|\s)(?=.)((?:0|(?:[1-9](?:\d*|\d{0,2}(?:,\d{3})*)))?(?:\.\d*[0-9])?)(?!\S))\s+((?:^|\s)(?=.)((?:0|(?:[1-9](?:\d*|\d{0,2}(?:,\d{3})*)))?(?:\.\d*[0-9])?)(?!\S))\s+((?:^|\s)(?=.)((?:0|(?:[1-9](?:\d*|\d{0,2}(?:,\d{3})*)))?(?:\.\d*[0-9])?)(?!\S)). It captures the following:

Text 1: group 2: 10; group 4: 129000; group 5: TEMPORARY RAILING (TYPE K) LF ; group 7: 960; group 9: 27.50; group 11: 26,400.00

Text 2: Does not capture anything because of (F). I think we can work on removing this before running the regex search.

Text 3: group 2: 21; group 4: 839312; group 5: DOUBLE THRIE BEAM BARRIER (STEEL POST) LF ; group 7: 21,600; group 9: 33.35; group 11: 720,360.00

Text 4: It does not capture anything due to 18"

  • In group 5 (([A-Z.]+\s\w+[^\n\d]*)), is there a way to tell regex that we want it to stop capturing beyond the first instance of 2 whitespaces? That should solve that problem?
  • Is there a way to add negative lookahead for (F) and (S) after group 2 ((\d+)) is captured? That should solve the problem?
  • Reference regex

Thank you so much for your help in advance!


Solution

  • Try:

    import re
    
    data = '''\
          10     129000      TEMPORARY RAILING (TYPE K)                     LF          960               27.50        26,400.00
          11     129100      TEMPORARY CRASH CUSHION MODULE                 EA           56              127.00         7,112.00
          12     150662      REMOVE METAL BEAM GUARD RAILING                LF        1,390                3.00         4,170.00
          13     153210      REMOVE CONCRETE                                CY            2              660.00         1,320.00
          14     015310      REMOVE BRIDGE APPROACH GUARD RAILING           LF          200                6.30         1,260.00
          15     156585      18" CURED-IN-PLACE PIPE LINER                           EA            1              300.00           300.00
          16     160101      CLEARING AND GRUBBING                          LS     LUMP SUM            2,500.00         2,500.00
          17     190110      LEAD COMPLIANCE PLAN                           LS     LUMP SUM              850.00           850.00
          18 (F) 510502      MINOR CONCRETE (MINOR STRUCTURE)               CY            4            2,740.00        10,960.00
          19     820118      GUARD RAILING DELINEATOR                       EA           12               15.00           180.00
          20     839303      SINGLE THRIE BEAM BARRIER (STEEL POST)         LF        3,630               22.00        79,860.00
          21     839312      DOUBLE THRIE BEAM BARRIER (STEEL POST)         LF       21,600               33.35       720,360.00
          22     839542      TRANSITION RAILING (TYPE DTB)                  EA            4            2,600.00        10,400.00
    '''
    
    data = data.replace('(F)', '   ') # 3 spaces!
    
    df = pd.DataFrame([re.split(r'\s{2,}', row) for row in data.splitlines()]).add_prefix('v').loc[:,'v1':]
    print(df.to_markdown(index=False))
    

    Prints:

    v1 v2 v3 v4 v5 v6 v7
    10 129000 TEMPORARY RAILING (TYPE K) LF 960 27.50 26,400.00
    11 129100 TEMPORARY CRASH CUSHION MODULE EA 56 127.00 7,112.00
    12 150662 REMOVE METAL BEAM GUARD RAILING LF 1,390 3.00 4,170.00
    13 153210 REMOVE CONCRETE CY 2 660.00 1,320.00
    14 015310 REMOVE BRIDGE APPROACH GUARD RAILING LF 200 6.30 1,260.00
    15 156585 18" CURED-IN-PLACE PIPE LINER EA 1 300.00 300.00
    16 160101 CLEARING AND GRUBBING LS LUMP SUM 2,500.00 2,500.00
    17 190110 LEAD COMPLIANCE PLAN LS LUMP SUM 850.00 850.00
    18 510502 MINOR CONCRETE (MINOR STRUCTURE) CY 4 2,740.00 10,960.00
    19 820118 GUARD RAILING DELINEATOR EA 12 15.00 180.00
    20 839303 SINGLE THRIE BEAM BARRIER (STEEL POST) LF 3,630 22.00 79,860.00
    21 839312 DOUBLE THRIE BEAM BARRIER (STEEL POST) LF 21,600 33.35 720,360.00
    22 839542 TRANSITION RAILING (TYPE DTB) EA 4 2,600.00 10,400.00

    Or you can use pd.read_fwf:

    from io import StringIO
    
    data = data.replace('(F)', '   ') # 3 spaces!
    
    df = pd.read_fwf(StringIO(data), header=None)
    print(df.to_markdown(index=False))