Search code examples
pythondataframestring-comparison

Comparing multiple columns for a single row


I am grouping columns and identifying rows that have different values for each group. For example: I can group columns A,B,C,D and delete column A because it is different (Row 2 is 2.1). Also, I can group columns E,F,G,H and delete column G because Row 1 (Row 0 is Blue).

      A |  B |  C   |   D    |   E   |   F   |   G  |   H
  | ---------------------------------------------------------|
0 | 1.0 |  1 | 1 in | 1 inch | Red   |  Red  | Blue |  Red
  | ---------------------------------------------------------|
1 | 2.0 |  2 | 2 in | 2 inch | Green | Green | Green| Green
  | ---------------------------------------------------------|
2 | 2.1 |  2 | 2 in | 2 inch | Blue  |  Blue | Blue |  Blue

What I have tried so far to compare values:

import difflib
text1 = '1.0'
text2 = '1 in'
text3 = '1 inch'
output = str(int(difflib.SequenceMatcher(None, text1, text2, text3).ratio()*100))

output: '28'

This does not work well to compare numbers followed by a measurement like inches or mm. I then tried spacy.load('en_core_web_sm') and that works better but its still not there yet. Are there any ways to compare a group of values that are similar to 1.0, 1, 1 in, 1 inch?


Solution

  • For columns with only strings, you can use pandas df.equals() that compares two dataframes or series (cols)

    #Example    
    df.E.equals(df.F)
    

    You can use this function to compare many columns to a single one I called main or template, which should be the column where you have the "correct" values.

    def col_compare(main_col, *to_compare):
      '''Compares each column from a list to another column
      Inputs: 
      * main_col: enter the column name (e.g. 'A')
      * to_compare: enter as many column names as you want (e.g. 'B', 'C') '''
      # Columns to compare to list
      to_compare = list(to_compare)
      # List to store results
      results = []
    
      # Compare columns from the list with the template column
      for col in to_compare:
        if not df[main_col].equals(df[col]):
          results.append(col)
      
      print(f'Main Column: {main_col}')
      print(f'Compared to: {to_compare}')
      return f"The columns that have different values from {main_col} are {results}"
    

    e.g

    `col_compare('E', 'F', 'G', 'H')`
    
    output:
    Main Column: E
    Compared to: ['F', 'G', 'H']
    The columns that have different values from E are ['G']
    

    For the columns A, B, C and D, where you have numbers you want to compare, but pieces of strings after that, one option is to extract the numbers into new columns just for comparison and you can drop them later. You can create new columns with the code below for each column with numbers and strings:

    df['C_num'] = df.C.apply( lambda x: int(re.search('[0-9]*', x).group() ) )
    

    and then use the function col_compare above to run the comparison between the numeric columns.