Using U-SQL to eliminate duplicate and null values in one specific column while keeping a 2nd column properly aligned

I am trying to use U-SQL to remove duplicate, null,'',and Nan cells in a specific column called "Function" of a csv file. I also want to keep the Product column correctly aligned with the Function column after the blank rows are removed. So i would want to remove the same rows in the Product column as I do in the Function column to keep them properly aligned. I want to only keep one occurrence of a duplicate Function row. In this case I only want to keep the very first occurrence. The Product column has no empty cells and has all unique values. Any help is greatly appreciated. I know this can be done in a much easier way, but I want to use the code to automate the process as the Data in the DataLake changes over time. I think I am somewhat close in the code i currently have. The actual data set is a very large file and I am fairly certain that there are at least 4 duplicate values in the Functions column that aren't simply empty cells. I need to eliminate both duplicate values and empty cells in the Function column because empty cells are being recognized as duplicates as well. I want to be able to use the Function values as a primary key in the next step of my school project that wont include the Product column.

DECLARE @inputfile string = "/input/Function.csv";
//DECLARE @OutputUserFile string = "/output/Test_Function/UniqueFunction.csv";

@RawData =
    EXTRACT Function string,
            Product string

    FROM @inputfile
    USING Extractors.Csv(encoding: Encoding.[ASCII]);

// Query from Function data
// Set ROW_NUMBER() of each row within the window partitioned by Function field
@RawDataDuplicates=
    SELECT ROW_NUMBER() OVER (PARTITION BY Function) AS RowNum, Function AS function
    FROM @RawData;

// ORDER BY Function to see duplicate rows next to one another
@RawDataDuplicates2=
    SELECT *
    FROM @RawDataDuplicates
    ORDER BY function
    OFFSET 0 ROWS;

// Write to File
//OUTPUT @RawDataDuplicates2
//TO "/output/Test_Function/FunctionOver-Dups.csv"
//USING Outputters.Csv();

// GROUP BY and count # of duplicates per Function
@groupBy = SELECT Function, COUNT(Function) AS FunctionCount
FROM @RawData
GROUP BY Function
ORDER BY Function
OFFSET 0 ROWS;

// Write to file 
//OUTPUT @groupBy
//TO "/output/Test_Function/FunctionGroupBy-Dups.csv"
//USING Outputters.Csv();

@RawDataDuplicates3 =
    SELECT * 
    FROM @RawDataDuplicates2
    WHERE RowNum == 1;

OUTPUT @RawDataDuplicates3
TO "/output/Test_Function/FunctionUniqueEmail.csv"
USING Outputters.Csv(outputHeader: true);    

//OUTPUT @RawData
//TO @OutputUserFile
//USING Outputters.Csv(outputHeader: true);

I have also commented out some code that I don't necessarily need. When I run the code as it is, I am currently getting this error: this E_CSC_USER_REDUNDANTSTATEMENTINSCRIPT, Error Message: This statement is dead code.. – It does not give a line number but likely the "Function AS function" line?

Here is a sample file that is a small slice of the full spreadsheet and only includes data in the 2 relevant columns. The full spreadsheet has data in all columns. https://www.dropbox.com/s/auu2aco4b037xn7/Function.csv?dl=0

here is a screenshot of the output I get when I follow wBob's advice and click.

Solution

You can apply a series of transformations to your data using string functions like .Length and ranking function like ROW_NUMBER to remove the records you want, for example:

@input = 
    EXTRACT 
        CompanyID             string,
        division              string,
        store_location        string,
        International_Id      string,
        Function              string,
        office_location       string,
        address               string,
        Product               string,
        Revenue               string,
        sales_goal            string,
        Manager               string,
        Country               string

    FROM "/input/input142.csv"
    USING Extractors.Csv(skipFirstNRows : 1 );


// Remove empty columns
@working =
    SELECT *
    FROM @input
    WHERE Function.Length > 0;


// Rank the columns by Function and keep only the first one
@working =
    SELECT CompanyID,
           division,
           store_location,
           International_Id,
           Function,
           office_location,
           address,
           Product,
           Revenue,
           sales_goal,
           Manager,
           Country
    FROM
    (
        SELECT *,
               ROW_NUMBER() OVER(PARTITION BY Function ORDER BY Product) AS rn
        FROM @working
    ) AS x
    WHERE rn == 1;


@output = SELECT * FROM @working;

OUTPUT @output TO "/output/output.csv"
USING Outputters.Csv(quoting:false);

My results: