Search code examples
azureazure-data-lakeu-sql

Using U-SQL to eliminate duplicate and null values in one specific column while keeping a 2nd column properly aligned


I am trying to use U-SQL to remove duplicate, null,'',and Nan cells in a specific column called "Function" of a csv file. I also want to keep the Product column correctly aligned with the Function column after the blank rows are removed. So i would want to remove the same rows in the Product column as I do in the Function column to keep them properly aligned. I want to only keep one occurrence of a duplicate Function row. In this case I only want to keep the very first occurrence. The Product column has no empty cells and has all unique values. Any help is greatly appreciated. I know this can be done in a much easier way, but I want to use the code to automate the process as the Data in the DataLake changes over time. I think I am somewhat close in the code i currently have. The actual data set is a very large file and I am fairly certain that there are at least 4 duplicate values in the Functions column that aren't simply empty cells. I need to eliminate both duplicate values and empty cells in the Function column because empty cells are being recognized as duplicates as well. I want to be able to use the Function values as a primary key in the next step of my school project that wont include the Product column.

DECLARE @inputfile string = "/input/Function.csv";
//DECLARE @OutputUserFile string = "/output/Test_Function/UniqueFunction.csv";

@RawData =
    EXTRACT Function string,
            Product string

    FROM @inputfile
    USING Extractors.Csv(encoding: Encoding.[ASCII]);

// Query from Function data
// Set ROW_NUMBER() of each row within the window partitioned by Function field
@RawDataDuplicates=
    SELECT ROW_NUMBER() OVER (PARTITION BY Function) AS RowNum, Function AS function
    FROM @RawData;

// ORDER BY Function to see duplicate rows next to one another
@RawDataDuplicates2=
    SELECT *
    FROM @RawDataDuplicates
    ORDER BY function
    OFFSET 0 ROWS;

// Write to File
//OUTPUT @RawDataDuplicates2
//TO "/output/Test_Function/FunctionOver-Dups.csv"
//USING Outputters.Csv();

// GROUP BY and count # of duplicates per Function
@groupBy = SELECT Function, COUNT(Function) AS FunctionCount
FROM @RawData
GROUP BY Function
ORDER BY Function
OFFSET 0 ROWS;

// Write to file 
//OUTPUT @groupBy
//TO "/output/Test_Function/FunctionGroupBy-Dups.csv"
//USING Outputters.Csv();

@RawDataDuplicates3 =
    SELECT * 
    FROM @RawDataDuplicates2
    WHERE RowNum == 1;

OUTPUT @RawDataDuplicates3
TO "/output/Test_Function/FunctionUniqueEmail.csv"
USING Outputters.Csv(outputHeader: true);    

//OUTPUT @RawData
//TO @OutputUserFile
//USING Outputters.Csv(outputHeader: true);

I have also commented out some code that I don't necessarily need. When I run the code as it is, I am currently getting this error: this E_CSC_USER_REDUNDANTSTATEMENTINSCRIPT, Error Message: This statement is dead code.. – It does not give a line number but likely the "Function AS function" line?

Here is a sample file that is a small slice of the full spreadsheet and only includes data in the 2 relevant columns. The full spreadsheet has data in all columns. https://www.dropbox.com/s/auu2aco4b037xn7/Function.csv?dl=0

here is a screenshot of the output I get when I follow wBob's advice and click. enter image description here


Solution

  • You can apply a series of transformations to your data using string functions like .Length and ranking function like ROW_NUMBER to remove the records you want, for example:

    @input = 
        EXTRACT 
            CompanyID             string,
            division              string,
            store_location        string,
            International_Id      string,
            Function              string,
            office_location       string,
            address               string,
            Product               string,
            Revenue               string,
            sales_goal            string,
            Manager               string,
            Country               string
    
        FROM "/input/input142.csv"
        USING Extractors.Csv(skipFirstNRows : 1 );
    
    
    // Remove empty columns
    @working =
        SELECT *
        FROM @input
        WHERE Function.Length > 0;
    
    
    // Rank the columns by Function and keep only the first one
    @working =
        SELECT CompanyID,
               division,
               store_location,
               International_Id,
               Function,
               office_location,
               address,
               Product,
               Revenue,
               sales_goal,
               Manager,
               Country
        FROM
        (
            SELECT *,
                   ROW_NUMBER() OVER(PARTITION BY Function ORDER BY Product) AS rn
            FROM @working
        ) AS x
        WHERE rn == 1;
    
    
    @output = SELECT * FROM @working;
    
    OUTPUT @output TO "/output/output.csv"
    USING Outputters.Csv(quoting:false);
    

    My results: Results