Search code examples
sql-serverrmicrosoft-r

R scripts in SQL Server 2016 corrupted with  character


I have found a strange behaviour of SQL Server 2016 when handling broken pipes inside R scripts. See the T-SQL code below:

DECLARE 
    @r nvarchar(100)

/* Create a data frame with a broken pipe as one of its fields and a simple ASCII encoded string in another. */
SET @r = N'
df <- data.frame(
    a = "¦",
    b = "a,b,c"
    )';

/* Print @r to detect the inclusion of any unwanted characters. */
PRINT @r;

/* Execute and retrieve the output. */
EXECUTE sp_execute_external_script
    @language = N'R',
    @script = @r,
    @output_data_1_name = N'df'
WITH RESULT SETS ((
    BadEncodingColumn varchar(2),
    GoodEncodingColumn varchar(5)
    ));

The PRINT command returns this in the Messages tab:

df <- data.frame(
    a = "¦",
    b = "a,b,c"
    )

However, the final Results tab looks like this:

BadEncodingColumn   GoodEncodingColumn
¦                  a,b,c

This behaviour seems to emerge at the EXECUTE sp_execute_external_script phase of the script, and I have seen this character (Â) when dealing with other encoding issues with Excel, R and other versions of SQL Server.

Any solutions to this behaviour? And bonus points, what is 'special' about the  character?

Edit: I have tried tinkering with data types inside SQL Server and R to no avail.


Solution

  • The issue appears to be with encoding of non-ASCII characters in the R script (broken pipe is outside the 128 ASCII characters). You can override the encoding using the ‘Encoding’ function explicitly to Unicode(UTF-8) to workaround the issue. For instance your script can be updated as follows

    DECLARE 
        @r nvarchar(100)
    
    /* Create a data frame with a broken pipe as one of its fields and a simple ASCII encoded string in another. */
    SET @r = N'
    df <- data.frame(
        a = "¦",
        b = "a,b,c"
        )
    
    Encoding(levels(df$a)) <- "UTF-8" ###### Encoding override'
    
    /* Print @r to detect the inclusion of any unwanted characters. */
    PRINT @r;
    
    /* Execute and retrieve the output. */
    EXECUTE sp_execute_external_script
        @language = N'R',
        @script = @r,
        @output_data_1_name = N'df'
    WITH RESULT SETS ((
        BadEncodingColumn varchar(2),
        GoodEncodingColumn varchar(5)
        ));
    

    Produces the following results

    BadEncodingColumn   GoodEncodingColumn
    ¦                   a,b,c