I am trying to use U-SQL to remove duplicate, null,'',and Nan cells in a specific column called "Function" of a csv file. I also want to keep the Product column correctly aligned with the Function column after the blank rows are removed. So i would want to remove the same rows in the Product column as I do in the Function column to keep them properly aligned. I want to only keep one occurrence of a duplicate Function row. In this case I only want to keep the very first occurrence. The Product column has no empty cells and has all unique values. Any help is greatly appreciated. I know this can be done in a much easier way, but I want to use the code to automate the process as the Data in the DataLake changes over time. I think I am somewhat close in the code i currently have. The actual data set is a very large file and I am fairly certain that there are at least 4 duplicate values in the Functions column that aren't simply empty cells. I need to eliminate both duplicate values and empty cells in the Function column because empty cells are being recognized as duplicates as well. I want to be able to use the Function values as a primary key in the next step of my school project that wont include the Product column.
DECLARE @inputfile string = "/input/Function.csv";
//DECLARE @OutputUserFile string = "/output/Test_Function/UniqueFunction.csv";
@RawData =
EXTRACT Function string,
Product string
FROM @inputfile
USING Extractors.Csv(encoding: Encoding.[ASCII]);
// Query from Function data
// Set ROW_NUMBER() of each row within the window partitioned by Function field
@RawDataDuplicates=
SELECT ROW_NUMBER() OVER (PARTITION BY Function) AS RowNum, Function AS function
FROM @RawData;
// ORDER BY Function to see duplicate rows next to one another
@RawDataDuplicates2=
SELECT *
FROM @RawDataDuplicates
ORDER BY function
OFFSET 0 ROWS;
// Write to File
//OUTPUT @RawDataDuplicates2
//TO "/output/Test_Function/FunctionOver-Dups.csv"
//USING Outputters.Csv();
// GROUP BY and count # of duplicates per Function
@groupBy = SELECT Function, COUNT(Function) AS FunctionCount
FROM @RawData
GROUP BY Function
ORDER BY Function
OFFSET 0 ROWS;
// Write to file
//OUTPUT @groupBy
//TO "/output/Test_Function/FunctionGroupBy-Dups.csv"
//USING Outputters.Csv();
@RawDataDuplicates3 =
SELECT *
FROM @RawDataDuplicates2
WHERE RowNum == 1;
OUTPUT @RawDataDuplicates3
TO "/output/Test_Function/FunctionUniqueEmail.csv"
USING Outputters.Csv(outputHeader: true);
//OUTPUT @RawData
//TO @OutputUserFile
//USING Outputters.Csv(outputHeader: true);
I have also commented out some code that I don't necessarily need. When I run the code as it is, I am currently getting this error: this E_CSC_USER_REDUNDANTSTATEMENTINSCRIPT, Error Message: This statement is dead code.. – It does not give a line number but likely the "Function AS function" line?
Here is a sample file that is a small slice of the full spreadsheet and only includes data in the 2 relevant columns. The full spreadsheet has data in all columns. https://www.dropbox.com/s/auu2aco4b037xn7/Function.csv?dl=0
here is a screenshot of the output I get when I follow wBob's advice and click.
You can apply a series of transformations to your data using string functions like .Length
and ranking function like ROW_NUMBER
to remove the records you want, for example:
@input =
EXTRACT
CompanyID string,
division string,
store_location string,
International_Id string,
Function string,
office_location string,
address string,
Product string,
Revenue string,
sales_goal string,
Manager string,
Country string
FROM "/input/input142.csv"
USING Extractors.Csv(skipFirstNRows : 1 );
// Remove empty columns
@working =
SELECT *
FROM @input
WHERE Function.Length > 0;
// Rank the columns by Function and keep only the first one
@working =
SELECT CompanyID,
division,
store_location,
International_Id,
Function,
office_location,
address,
Product,
Revenue,
sales_goal,
Manager,
Country
FROM
(
SELECT *,
ROW_NUMBER() OVER(PARTITION BY Function ORDER BY Product) AS rn
FROM @working
) AS x
WHERE rn == 1;
@output = SELECT * FROM @working;
OUTPUT @output TO "/output/output.csv"
USING Outputters.Csv(quoting:false);