Search code examples
pythonpandasdataframetextreplace

Replacement with regex or anything else in pandas dataframe


I want to manage data that I receive from multiple parties and convert it into structured data in order to have unity in our system.

For example, I receive data like this:

  1. nominal diameter 1-13 x 0.5 mm
  2. nominal diameter 10 mm
  3. for external diameter 15mm
  4. head dm 9.00 mm
  5. diameter 208/20 mm height 218 mm

The goal is to retrieve in order this output

  1. M1-13x0.5
  2. M10
  3. M15
  4. M9
  5. M208/20 H28

I thought I would do it with multiple regexes like and then replace it

df['diameter'] = df['New_size'].str.findall('^nominal diameter\s([\S]+)\sx\s([\S]+)\smm$')
df['diam2'] = 'Nom.M'+df['diameter'].str[0].str[0]+'x'+df['diameter'].str[0].str[1]

df['diameter'] = df['New_size'].str.findall('^nominal diameter\s([\S]+)\smm$')
df['diam2'] = 'Nom.M'+df['diameter'].str[0]

But this only helps when searching for diameter, In example 5 I need to run it again and join somehow to also add height…

Is there a way how to loop through data and replace like in Excel + regex for regex? Or is my plan completely stupid and there are much better ways how to do this?

I thought of plan to split everything into pieces and replace word after word. and then joining only those info I think is necessary. but really it seems I am too complex Thasnk!


Solution

  • With this type of editing, you cant be absolute the first time especially if the df is big and has diverse text. Go staged.

    Option 1: Replace patterns rather than target substrings

    replace values in df. First replace all alphas except x followed by space and before digit with nothing. Then from start of string replace all Non digits before numeral. Then replace all dots followed by 2 zeros

    df['text_edited'] = df.replace(regex={'[a-wy-z]+\s(?=\d)': 'H', '^[\D]+(?=\d)': 'M', 'mm': '', '.00':''})
    
         
    
                       text                text_edited
    0    nominal diameter 1-13 x 0.5 mm    M1-13 x 0.5 
    1            nominal diameter 10 mm            M10 
    2        for external diameter 15mm             M15
    3                   head dm 9.00 mm             M9 
    4  diameter 208/20 mm height 218 mm  M208/20  H218
    

    Option 2: Replace target substrings

    I assume your main aim is to replace diameter and height with initials. Do that and if there are any unwanted residuals that significantly change the attribute meaning, edit the new column further

    df['text_edited'] = df.replace(regex={'diameter\s|dm\s': 'M','height\s': 'H','[^MHx0-9\W]':' ', '.00':''})
    
    
    
    df['text_edited'] = df['text_edited'].str.strip().str.replace('^[x]','', regex=True)
    
    
    
                        text                 text_edited
    0    nominal diameter 1-13 x 0.5 mm      M1-13 x 0.5
    1            nominal diameter 10 mm              M10
    2        for external diameter 15mm              M15
    3                   head dm 9.00 mm               M9
    4  diameter 208/20 mm height 218 mm  M208/20    H218