Search code examples
pythonpython-3.xpandastext-extraction

Pandas: Extract numerical values after keywords substrings of inconsistent delimiters & irregular orders and transform keywords & numbers to columns


I have an irregularly formatted database with the dataframe looks as follows:

Area Dimensions
foo Length: 2m; Width: 3m; Height: 4m; Slope- 3
bar Width: 6m; Length: 4m; Height: 3m; Slope: 6
baz Height: 4m; Slope: 4; Volume = 24m3
qux Vol: 42m3

The delimiters are consistently semi-colons, but the colons may be replaced by some other symbol such as a dash or an equal sign. The order of the values are also inconsistent, so str.split was not effective. I would like to extract as much information as I can from the Dimensions column and leave 0/Null values for values not specified.

I would like it to look as follows:

Area Length Width Height Slope Volume
foo 2 3 4 3 NULL
bar 4 6 3 6 NULL
baz NULL NULL 4 4 24
qux NULL NULL NULL NULL 42

Solution

  • New Version:

    Main enhancement of new version is to substantially simplify the creation of keyword-value table. Text extraction regex is also simplified to without specifying a set of predefined keywords.

    Use str.findall() + map, as follows:

    1. Extract the Dimensions keywords and values by str.findall() into a list of key-value pairs tuples
    2. map these key-value pairs tuples to dict and create a dataframe
    3. Join Area column with the newly created keyword-value dataframe by .join()
    # replace 'Vol' to 'Volume` 
    # extract `Dimensions` keywords and numeric values into tuples of paired values
    dim_extract = (df['Dimensions'].str.replace(r'Vol\b', 'Volume', regex=True)
                                   .str.findall(r'(\w+)\W+(\d+(?:\.\d+)?)\w*(?:;|$)')
                  )
    
    # map key-value pairs to `dict` and create a dataframe
    keyword_df = pd.DataFrame(map(dict, dim_extract))
    
    # Optionally convert the extracted dimension values from string to float or integer format 
    #keyword_df = keyword_df.apply(pd.to_numeric)                    # convert to float
    #keyword_df = keyword_df.apply(pd.to_numeric).astype('Int64')    # convert to integer
    
    # join `Area` column with newly created keyword dataframe
    df_out = df[['Area']].join(keyword_df)
    

    Result:

    print(df_out)
    
      Area Length Width Height Slope Volume
    0  foo      2     3      4     3    NaN
    1  bar      4     6      3     6    NaN
    2  baz    NaN   NaN      4     4     24
    3  qux    NaN   NaN    NaN   NaN     42
    

    Old Version:

    Use str.findall() + .explode() + .pivot(), as follows:

    1. Extract the Dimensions keywords and values by str.findall() into a list of key-value pairs tuples
    2. Transform each element in the list to a row by .explode()
    3. Further separate the paired values of Dimensions keywords and values from a tuple into individual columns
    4. Transform the Dimensions keywords into columns by .pivot()
    # replace 'Vol' to 'Volume` 
    # extract `Dimensions` keywords and numeric values into tuples of paired values
    df['extract'] = (df['Dimensions'].str.replace(r'Vol\b', 'Volume', regex=True)
                                     .str.findall(r'(Length|Width|Height|Slope|Volume)\W+(\d+(?:\.\d+)?)\w*(?:;|$)')
                    )
    
    # Transform each element in the list to a row
    df2 = df.explode('extract')
    
    # Separate the `Dimensions` keywords and values from a tuple into individual columns 
    df2['col_name'], df2['col_val'] = zip(*df2['extract'])
    
    # Optionally convert the extracted dimension values from string to float or integer format 
    #df2['col_val'] = df2['col_val'].astype(float)
    #df2['col_val'] = df2['col_val'].astype(int)
    
    # Transform the `Dimensions` keywords into columns 
    df_out = df2.pivot(index='Area', columns='col_name', values='col_val').rename_axis(columns=None).reset_index()
    

    Result:

    print(df_out)
    
      Area Height Length Slope Volume Width
    0  bar      3      4     6    NaN     6
    1  baz      4    NaN     4     24   NaN
    2  foo      4      2     3    NaN     3
    3  qux    NaN    NaN   NaN     42   NaN