I have an irregularly formatted database with the dataframe looks as follows:
Area | Dimensions |
---|---|
foo | Length: 2m; Width: 3m; Height: 4m; Slope- 3 |
bar | Width: 6m; Length: 4m; Height: 3m; Slope: 6 |
baz | Height: 4m; Slope: 4; Volume = 24m3 |
qux | Vol: 42m3 |
The delimiters are consistently semi-colons, but the colons may be replaced by some other symbol such as a dash or an equal sign. The order of the values are also inconsistent, so str.split
was not effective. I would like to extract as much information as I can from the Dimensions
column and leave 0/Null values for values not specified.
I would like it to look as follows:
Area | Length | Width | Height | Slope | Volume |
---|---|---|---|---|---|
foo | 2 | 3 | 4 | 3 | NULL |
bar | 4 | 6 | 3 | 6 | NULL |
baz | NULL | NULL | 4 | 4 | 24 |
qux | NULL | NULL | NULL | NULL | 42 |
Main enhancement of new version is to substantially simplify the creation of keyword-value table. Text extraction regex is also simplified to without specifying a set of predefined keywords.
str.findall()
+ map
, as follows:Dimensions
keywords and values by str.findall()
into a list of key-value pairs tuplesmap
these key-value pairs tuples to dict
and create a dataframeArea
column with the newly created keyword-value dataframe by .join()
# replace 'Vol' to 'Volume`
# extract `Dimensions` keywords and numeric values into tuples of paired values
dim_extract = (df['Dimensions'].str.replace(r'Vol\b', 'Volume', regex=True)
.str.findall(r'(\w+)\W+(\d+(?:\.\d+)?)\w*(?:;|$)')
)
# map key-value pairs to `dict` and create a dataframe
keyword_df = pd.DataFrame(map(dict, dim_extract))
# Optionally convert the extracted dimension values from string to float or integer format
#keyword_df = keyword_df.apply(pd.to_numeric) # convert to float
#keyword_df = keyword_df.apply(pd.to_numeric).astype('Int64') # convert to integer
# join `Area` column with newly created keyword dataframe
df_out = df[['Area']].join(keyword_df)
Result:
print(df_out)
Area Length Width Height Slope Volume
0 foo 2 3 4 3 NaN
1 bar 4 6 3 6 NaN
2 baz NaN NaN 4 4 24
3 qux NaN NaN NaN NaN 42
str.findall()
+ .explode()
+ .pivot()
, as follows:Dimensions
keywords and values by str.findall()
into a list of key-value pairs tuples.explode()
Dimensions
keywords and values from a tuple into individual columnsDimensions
keywords into columns by .pivot()
# replace 'Vol' to 'Volume`
# extract `Dimensions` keywords and numeric values into tuples of paired values
df['extract'] = (df['Dimensions'].str.replace(r'Vol\b', 'Volume', regex=True)
.str.findall(r'(Length|Width|Height|Slope|Volume)\W+(\d+(?:\.\d+)?)\w*(?:;|$)')
)
# Transform each element in the list to a row
df2 = df.explode('extract')
# Separate the `Dimensions` keywords and values from a tuple into individual columns
df2['col_name'], df2['col_val'] = zip(*df2['extract'])
# Optionally convert the extracted dimension values from string to float or integer format
#df2['col_val'] = df2['col_val'].astype(float)
#df2['col_val'] = df2['col_val'].astype(int)
# Transform the `Dimensions` keywords into columns
df_out = df2.pivot(index='Area', columns='col_name', values='col_val').rename_axis(columns=None).reset_index()
Result:
print(df_out)
Area Height Length Slope Volume Width
0 bar 3 4 6 NaN 6
1 baz 4 NaN 4 24 NaN
2 foo 4 2 3 NaN 3
3 qux NaN NaN NaN 42 NaN