I'm attempting to clean up a set of old files that contain sensor data measurements. Many of the files don't have headers, and the format (column ordering, etc.) is inconsistent. I'm thinking the best that I can do in these cases is to match statistical profiles of the columns to data from files that do have good headers. This seems like it should be simple using something like Pandas Profiling, but I haven't found any examples. I'm looking for something that would calculate a score for the similarity between each column in the header-less file and each "known" column for which I already have headers.
Example Data with Headers:
Large Value Column | Small Value Column |
---|---|
100 | 5 |
102 | 12 |
110 | 8 |
98 | 10 |
Example Data with only column numbers:
0 | 1 |
---|---|
6 | 99 |
9 | 105 |
11 | 101 |
14 | 100 |
For the above example, I would like to automatically determine that column 1 should be added to "Large Value Column" and Column 0 to "Small Value Column".
As stated by @Arthur Bricq, the best solution depends a lot on the shape of your data. Unfortunately, here's no "one solution fits all approach". Having said that, the following code was able to correctly rename the columns from your example, using the fuzzywuzzy
package:
Note: You need to pip install
fuzzywuzzy
before running the code below. To do so, run the following command:
pip install fuzzywuzzy
from typing import Tuple, List
from fuzzywuzzy import fuzz
import pandas as pd
def find_similar(
*dfs: pd.DataFrame, known_df: pd.DataFrame
) -> Tuple[List[pd.DataFrame], pd.DataFrame]:
"""
Name columns in a list of dataframes based on the columns from known dataframe.
Each of the dataframes in the list will be renamed based on the columns from the
known dataframe. To avoid cases where two different unnamed columns match the
same column from the known dataframe, the function matches each of the known columns
to the unnamed column that contains the highest similarity score and then removes
the matched columns from the list of possible columns. For example:
>>> x = pd.DataFrame(
... {'Large Value Column' :[100, 102, 110, 98,]
... 'Small Value Column': [5, 12, 8, 10,]}
... )
>>> y = pd.DataFrame(
... {0: [6, 9, 11, 14],
... 1: [20, 25, 55, 65]}
... )
>>> pd.DataFrame(
... [
... [
... fuzz.ratio(
... list(map(str, y[col].values)), list(map(str, x[_col].values))
... ) for _col in x.columns
... ] for col in y.columns
... ],
... index=y.columns,
... columns=x.columns,
... )
Large Value Column Small Value Column
0 73 77
1 71 74
In the above example, "Small Value Column" has the highest similarity score for
both columns "0" and "1". However, since "0" has the highest similarity score
then, "1" will be renamed to "Large Value Column".
Parameters
----------
dfs : pd.DataFrame
List of dataframes to rename.
known_df : pd.DataFrame
Pandas dataframe with the correct column names.
Returns
-------
Tuple[List[pd.DataFrame], pd.DataFrame]
List of dataframes with renamed columns, and the known dataframe.
"""
# Store the results in a list
_dfs = []
# Loop over the dataframes
for df in dfs:
# Create a dataframe that contains the similarities between the columns
# in the known dataframe and the columns in the current dataframe.
# The columns represent the known dataframe columns, and the index
# represents the current dataframe columns.
similarity_df = pd.DataFrame(
[
[
fuzz.ratio(
list(map(str, df[col].values)),
list(map(str, known_df[_col].values)),
)
for _col in known_df.columns
]
for col in df.columns
],
index=df.columns,
columns=known_df.columns,
)
# Dictionary to map old and new column names
rename_dict = {}
# Keep track of the columns that have already been matched
# to avoid naming two columns using the same name.
used_names = []
for col in similarity_df.max().sort_values(ascending=False).keys():
old_name = similarity_df[~similarity_df.index.isin(used_names)][
col
].idxmax()
rename_dict[old_name] = col
used_names.append(old_name)
# Rename the columns in the current dataframe and append it to the list
_dfs.append(df.rename(columns=rename_dict, errors="ignore"))
return _dfs, known_df
import pandas as pd
x = pd.DataFrame(
{
"Large Value Column": [100, 102, 110, 98],
"Small Value Column": [5, 12, 8, 10],
}
)
y = pd.DataFrame({0: [6, 9, 11, 14], 1: [20, 40, 55, 65]})
z = pd.DataFrame({0: [95, 80, 72, 100], 1: [0, 20, 14, 10]})
pd.concat(find_similar(y, z, known_df=x)[0])
# Returns:
#
# Small Value Column Large Value Column
# 0 6 20
# 1 9 40
# 2 11 55
# 3 14 65
# 0 0 95
# 1 20 80
# 2 14 72
# 3 10 100