Search code examples
pythonpandasstringdataframeexpand

Expand a dataframe row into multiple rows based on string conditions


I have some raw data similar to the dataframe below:


df = pd.DataFrame([{'var1': '220-224 (Even) roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': 'site of 5 to 9 (odd) roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '16, 19 roadname3', 'var2': 'location 3', 'var3': 'area 3'}]
                 )
df

var1    var2    var3
0   220-224 (Even) roadname1    location 1  area 1
1   site of 5 to 9 (odd) roadname2  location 2  area 2
2   16, 19 roadname3    location 3  area 3

I would like to write a function that will split the var1 strings so that each number indicated becomes a separate row in the dataframe, with an output such as:


df = pd.DataFrame([{'var1': '220 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': '222 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': '224 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': '5 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '7 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '9 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '16 roadname3', 'var2': 'location 3', 'var3': 'area 3'},
                  {'var1': '19 roadname3', 'var2': 'location 3', 'var3': 'area 3'},]
                 )
df

var1    var2    var3
0   220 roadname1   location 1  area 1
1   222 roadname1   location 1  area 1
2   224 roadname1   location 1  area 1
3   5 roadname2     location 2  area 2
4   7 roadname2     location 2  area 2
5   9 roadname2     location 2  area 2
6   16 roadname3    location 3  area 3
7   19 roadname3    location 3  area 3

the string conditions are a bit variable with capitalization and number ranges and I am not sure if there is an efficient way to do this that can handle the string variation.


Solution

  • Use a custom function to split the ranges (below is an example using regular expressions), then explode:

    import re
    
    def parse_range(s):
        # handle the "x-y" / "x to y" case with optional odd/even
        pat1 = r'^\D*(\d+)(?:-|\s+to\s+)(\d+)(?:\s*\((even|odd)\))?\s*(.*)$'
        # handle the "a,b,c" case
        pat2 = r'^\D*([\d ,]+)\s*(.*)$'
        m1 = re.search(pat1, s.lower())
        if m1:
            end = m1.group(4)
            if m1.group(3): # if odd/even only generate every other value
                # NB. there is no check that odd/even actually matches the
                # parity of the numbers, but it is easy to add if needed
                return [f'{i} {end}' for i in
                        range(int(m1.group(1)), int(m1.group(2))+1, 2)]
            else: # generate all numbers in range
                return [f'{i} {end}' for i in
                        range(int(m1.group(1)), int(m1.group(2))+1)]
        m2 = re.search(pat2, s.lower())
        if m2: # second case, split individual digits
            end = m2.group(2)
            return [f'{i} {end}' for i in re.findall(r'\d+', m2.group(1))]
        return s # failback, return the string unchanged
        
    out = (df.assign(var1=df['var1'].map(parse_range))
             .explode('var1')
          )
    

    Output:

                var1        var2    var3
    0  220 roadname1  location 1  area 1
    0  222 roadname1  location 1  area 1
    0  224 roadname1  location 1  area 1
    1    5 roadname2  location 2  area 2
    1    7 roadname2  location 2  area 2
    1    9 roadname2  location 2  area 2
    2   16 roadname3  location 3  area 3
    2   19 roadname3  location 3  area 3
    

    regex 1 demo

    regex 2 demo