Search code examples
pythonpandaspandas-profilingdata-qualitydata-profiling

Detecting similar columns across multiple files based on statistical profile


I'm attempting to clean up a set of old files that contain sensor data measurements. Many of the files don't have headers, and the format (column ordering, etc.) is inconsistent. I'm thinking the best that I can do in these cases is to match statistical profiles of the columns to data from files that do have good headers. This seems like it should be simple using something like Pandas Profiling, but I haven't found any examples. I'm looking for something that would calculate a score for the similarity between each column in the header-less file and each "known" column for which I already have headers.

Example Data with Headers:

Large Value Column Small Value Column
100 5
102 12
110 8
98 10

Example Data with only column numbers:

0 1
6 99
9 105
11 101
14 100

For the above example, I would like to automatically determine that column 1 should be added to "Large Value Column" and Column 0 to "Small Value Column".


Solution

  • As stated by @Arthur Bricq, the best solution depends a lot on the shape of your data. Unfortunately, here's no "one solution fits all approach". Having said that, the following code was able to correctly rename the columns from your example, using the fuzzywuzzy package:

    Note: You need to pip install fuzzywuzzy before running the code below. To do so, run the following command:

    pip install fuzzywuzzy

    
    from typing import Tuple, List
    from fuzzywuzzy import fuzz
    import pandas as pd
    
    
    def find_similar(
        *dfs: pd.DataFrame, known_df: pd.DataFrame
    ) -> Tuple[List[pd.DataFrame], pd.DataFrame]:
        """
        Name columns in a list of dataframes based on the columns from known dataframe.
    
        Each of the dataframes in the list will be renamed based on the columns from the
        known dataframe. To avoid cases where two different unnamed columns match the
        same column from the known dataframe, the function matches each of the known columns
        to the unnamed column that contains the highest similarity score and then removes
        the matched columns from the list of possible columns. For example:
    
        >>> x = pd.DataFrame(
        ...     {'Large Value Column' :[100, 102, 110, 98,]
        ...      'Small Value Column': [5, 12, 8, 10,]}
        ... )
        >>> y = pd.DataFrame(
        ...     {0: [6, 9, 11, 14],
        ...      1: [20, 25, 55, 65]}
        ... )
        >>> pd.DataFrame(
        ...     [
        ...         [
        ...             fuzz.ratio(
        ...                 list(map(str, y[col].values)), list(map(str, x[_col].values))
        ...             ) for _col in x.columns
        ...         ] for col in y.columns
        ...     ],
        ...     index=y.columns,
        ...     columns=x.columns,
        ... )
           Large Value Column  Small Value Column
        0                  73                  77
        1                  71                  74
    
        In the above example, "Small Value Column" has the highest similarity score for
        both columns "0" and "1". However, since "0" has the highest similarity score
        then, "1" will be renamed to "Large Value Column".
    
        Parameters
        ----------
        dfs : pd.DataFrame
            List of dataframes to rename.
        known_df : pd.DataFrame
            Pandas dataframe with the correct column names.
    
        Returns
        -------
        Tuple[List[pd.DataFrame], pd.DataFrame]
            List of dataframes with renamed columns, and the known dataframe.
        """
        # Store the results in a list
        _dfs = []
        # Loop over the dataframes
        for df in dfs:
            # Create a dataframe that contains the similarities between the columns
            # in the known dataframe and the columns in the current dataframe.
            # The columns represent the known dataframe columns, and the index
            # represents the current dataframe columns.
            similarity_df = pd.DataFrame(
                [
                    [
                        fuzz.ratio(
                            list(map(str, df[col].values)),
                            list(map(str, known_df[_col].values)),
                        )
                        for _col in known_df.columns
                    ]
                    for col in df.columns
                ],
                index=df.columns,
                columns=known_df.columns,
            )
            # Dictionary to map old and new column names
            rename_dict = {}
    
            # Keep track of the columns that have already been matched
            # to avoid naming two columns using the same name.
            used_names = []
            for col in similarity_df.max().sort_values(ascending=False).keys():
                old_name = similarity_df[~similarity_df.index.isin(used_names)][
                    col
                ].idxmax()
                rename_dict[old_name] = col
                used_names.append(old_name)
            # Rename the columns in the current dataframe and append it to the list
            _dfs.append(df.rename(columns=rename_dict, errors="ignore"))
        return _dfs, known_df
    
    

    Example

    import pandas as pd
    
    
    x = pd.DataFrame(
        {
            "Large Value Column": [100, 102, 110, 98],
            "Small Value Column": [5, 12, 8, 10],
        }
    )
    y = pd.DataFrame({0: [6, 9, 11, 14], 1: [20, 40, 55, 65]})
    z = pd.DataFrame({0: [95, 80, 72, 100], 1: [0, 20, 14, 10]})
    
    pd.concat(find_similar(y, z, known_df=x)[0])
    
    # Returns:
    #
    #    Small Value Column  Large Value Column
    # 0                   6                  20
    # 1                   9                  40
    # 2                  11                  55
    # 3                  14                  65
    # 0                   0                  95
    # 1                  20                  80
    # 2                  14                  72
    # 3                  10                 100
    

    enter image description here