Search code examples
sqlduplicatescasealter

SQL: Deleting Identical Columns With Different Names


My original table ("original_table") looks like this (contains both numeric and character variables):

  age height height2 gender gender2
1  18   76.1    76.1      M       M
2  19   77.0    77.0      F       F
3  20   78.1    78.1      M       M
4  21   78.2    78.2      M       M
5  22   78.8    78.8      F       F
6  23   79.7    79.7      F       F

I would like to remove columns from this table that have identical entries, but are named differently. In the end, this should look like this ("new_table"):

  age height gender
1  18   76.1      M
2  19   77.0      F
3  20   78.1      M
4  21   78.2      M
5  22   78.8      F
6  23   79.7      F

My Question: Is there a standard way to do this in SQL? I tried to do some research and came across the following link : How do I compare two columns for equality in SQL Server?

What I Tried So Far: It seems that something like this might work:

CREATE TABLE new_table AS SELECT * FROM original_table;

ALTER TABLE new_table
ADD does_age_equal_height varchar(255);

    UPDATE new_table 

SET     does_age_equal_height =  CASE  
WHEN age = height THEN '1' ELSE '0' END AS does_age_equal_height;
                  

From here, if the "sum" of all values in the "does_age_equal_height" column equals to the number of rows from "new_table" (i.e. select count(rownum) from new_table) - this must mean that both columns are equal, and that one of the columns can be dropped.

However, this is a very inefficient method, even for tables having a small number of columns. In my example, I have 5 columns - this means that I would have to repeat the above process " 5C2" times, i.e. 5! / (2!*3!) = 10 times. For example:

ALTER TABLE employees
  ADD does_age_equal_height varchar(255),
     does_age_equal_height2 varchar(255)
does_age_equal_gender varchar(255)
does_age_equal_gender2 varchar(255)
does_height_equal_height2 varchar(255)
does_height_equal_gender varchar(255)
does_height_equal_gender2 varchar(255)
does_height2_equal_gender varchar(255)
does_height2_equal_gender2 varchar(255)
does_gender_equal_gender2 varchar(255);

This would then be followed by multiple CASE statements - further complicating the process.

Can someone please show me a more efficient way of doing this?

Thanks!


Solution

  • I hope to get your problem in the right way. This is my code in SqlServer to handle it, you should customize it based on Netezza SQL.

    My idea is:
    Calculate MD5 for each column and then compare these columns together, if there is the same hash, one of the columns will be chosen.

    I going to create the below table for this problem:

    CREATE TABLE Students
    (
        Id INT PRIMARY KEY IDENTITY,
        StudentName VARCHAR (50),
        Course VARCHAR (50),
        Score INT,
        lastName VARCHAR (50) -- another alias for StudentName ,
        metric INT, -- another alias for score
        className VARCHAR(50) -- another alias for Course 
    )
    GO
    
    INSERT INTO Students VALUES ('Sally', 'English', 95, 'Sally', 95, 'English');
    INSERT INTO Students VALUES ('Sally', 'History', 82, 'Sally', 82, 'History');
    INSERT INTO Students VALUES ('Edward', 'English', 45, 'Edward', 45, 'English');
    INSERT INTO Students VALUES ('Edward', 'History', 78, 'Edward', 78, 'History');
    

    after creating the table and inserting sample records, it turns to find similar columns.

    step 1. Declare variables.

    DECLARE @cols_q VARCHAR(max), 
            @cols VARCHAR(max), 
            @table_name  VARCHAR(max)= N'Students', 
            @res  NVARCHAR(max),
            @newCols VARCHAR(max),
            @finalResQuery VARCHAR(max);
    

    step 2. Generate dynamics query for calculating a hash for every column.

    SELECT @cols_q = COALESCE(@cols_q+ ', ','')+'HASHBYTES(''MD5'',  CONVERT(varbinary(max), (select '+ COLumn_NAME +' as t  from Students FOR XML AUTO))) as '+ COLumn_NAME,
            @cols = coalesce(@cols + ',','')+COLumn_NAME
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_NAME = @table_name; 
     set  @cols_q = 'select '+  @cols_q +' into ##tmp_'+ @table_name+' from '+ @table_name;
    

    step 3. Run generated query.

     exec(@cols_q)
    

    step 4. Get columns that duplicated columns removed.

      set @res = N'select  uniq_colname into ##temp_colnames
      from(
      select  max(colname) as uniq_colname from (
         select * from ##tmp_Students 
     )tt
     unpivot ( 
        md5_hash for colname in ( '+ @cols +')
     ) as tbl
     group by md5_hash
     )tr';
       exec ( @res);
    

    step 5. Get final results

    select @newCols  = COALESCE(@newCols+ ', ','')+ uniq_colname from ##temp_colnames
    
     set @finalResQuery = 'select '+ @newCols +' from '+ @table_name;
     exec (@finalResQuery)