python python-3.x pandas text-extraction

Pandas: Extract numerical values after keywords substrings of inconsistent delimiters & irregular orders and transform keywords & numbers to columns

I have an irregularly formatted database with the dataframe looks as follows:

Area	Dimensions
foo	Length: 2m; Width: 3m; Height: 4m; Slope- 3
bar	Width: 6m; Length: 4m; Height: 3m; Slope: 6
baz	Height: 4m; Slope: 4; Volume = 24m3
qux	Vol: 42m3

The delimiters are consistently semi-colons, but the colons may be replaced by some other symbol such as a dash or an equal sign. The order of the values are also inconsistent, so str.split was not effective. I would like to extract as much information as I can from the Dimensions column and leave 0/Null values for values not specified.

I would like it to look as follows:

Area	Length	Width	Height	Slope	Volume
foo	2	3	4	3	NULL
bar	4	6	3	6	NULL
baz	NULL	NULL	4	4	24
qux	NULL	NULL	NULL	NULL	42

Solution

New Version:

Main enhancement of new version is to substantially simplify the creation of keyword-value table. Text extraction regex is also simplified to without specifying a set of predefined keywords.

Use `str.findall()` + `map`, as follows:

Extract the Dimensions keywords and values by str.findall() into a list of key-value pairs tuples
map these key-value pairs tuples to dict and create a dataframe
Join Area column with the newly created keyword-value dataframe by .join()

# replace 'Vol' to 'Volume` 
# extract `Dimensions` keywords and numeric values into tuples of paired values
dim_extract = (df['Dimensions'].str.replace(r'Vol\b', 'Volume', regex=True)
                               .str.findall(r'(\w+)\W+(\d+(?:\.\d+)?)\w*(?:;|$)')
              )

# map key-value pairs to `dict` and create a dataframe
keyword_df = pd.DataFrame(map(dict, dim_extract))

# Optionally convert the extracted dimension values from string to float or integer format 
#keyword_df = keyword_df.apply(pd.to_numeric)                    # convert to float
#keyword_df = keyword_df.apply(pd.to_numeric).astype('Int64')    # convert to integer

# join `Area` column with newly created keyword dataframe
df_out = df[['Area']].join(keyword_df)

Result:

print(df_out)

  Area Length Width Height Slope Volume
0  foo      2     3      4     3    NaN
1  bar      4     6      3     6    NaN
2  baz    NaN   NaN      4     4     24
3  qux    NaN   NaN    NaN   NaN     42

Old Version:

Use `str.findall()` + `.explode()` + `.pivot()`, as follows:

Extract the Dimensions keywords and values by str.findall() into a list of key-value pairs tuples
Transform each element in the list to a row by .explode()
Further separate the paired values of Dimensions keywords and values from a tuple into individual columns
Transform the Dimensions keywords into columns by .pivot()

# replace 'Vol' to 'Volume` 
# extract `Dimensions` keywords and numeric values into tuples of paired values
df['extract'] = (df['Dimensions'].str.replace(r'Vol\b', 'Volume', regex=True)
                                 .str.findall(r'(Length|Width|Height|Slope|Volume)\W+(\d+(?:\.\d+)?)\w*(?:;|$)')
                )

# Transform each element in the list to a row
df2 = df.explode('extract')

# Separate the `Dimensions` keywords and values from a tuple into individual columns 
df2['col_name'], df2['col_val'] = zip(*df2['extract'])

# Optionally convert the extracted dimension values from string to float or integer format 
#df2['col_val'] = df2['col_val'].astype(float)
#df2['col_val'] = df2['col_val'].astype(int)

# Transform the `Dimensions` keywords into columns 
df_out = df2.pivot(index='Area', columns='col_name', values='col_val').rename_axis(columns=None).reset_index()