Search code examples
azureazure-data-lakeu-sql

Removing empty lines in a string using U-SQL


I get reviews from customers into azure data lake which are combined/formatted in U-SQL. Some of the customers leave a space in between their comment lines. Any idea how to remove these blank lines using U-Sql?

E.g of a single comment

My name is abc
<blank line>
I love playing football. 

Need to remove the blank line in between. Trim() removes the start and end whitespaces only. Thanks.


Solution

  • You can use RegEx and U-SQL to clean up your text, for example, this simple script replaces two carriage returns with a single carriage return, removing blank lines:

    @input = 
        SELECT * FROM
            ( VALUES
                ( 1, @"My name is abc
    
    I love playing football." ),
                ( 2, @"I love U-SQL
    
    I'm indifferent to Hadoop." )
            ) AS x( id, review );
    
    
    
    // Strip out repeated carriage returns
    @output =
        SELECT id, 
                Regex.Replace(review, "(\r\n){2}", "\r\n", RegexOptions.Multiline) AS cleanedReview
        FROM @input;
    
    
    OUTPUT @output
    TO "/output/output.csv"
    USING Outputters.Csv();
    

    You may have to experiment with the RegEx expression to clean your particular data.