Search code examples
u-sql

U-SQL Error Extracting from TXT file


when running my extract, got this error:

Found invalid character-encoding for UTF-8 encoding in input. The input file may contain corrupted data, or the specified input encoding in the extractor does not match the actual file encoding. See the DETAILS section for a hexadecimal dump of the file segment containing the invalid character-encoding.

I am not able to read UTF-8 character data through below U-SQL script.

@cgadmdomain =
EXTRACT 
row_id string,
orgarea_name string,
last_changed_time string,
start_date string,
stop_date string,
domain_name string,
gui_description string,
media string,
direction string,
distribution string,
threshold1 string,
threshold2 string


FROM @cgadmdomainInPath USING Extractors.Text(delimiter: ';');

File has the data "Test Kö CB" for media column . If I remove this particular record then my script is running fine,please let me know if i need to add anything to the parameters


Solution

  • Are you sure that the file is encoded in UTF-8 and not some other encoding? What is the byte sequence that you see if you open the file with a byte level editor?

    Depending on that, you may have to set it to the appropriate Windows-125x encoding or Unicode.

    If your data is for example encoded with Windows-1252, you can extract it with the following statement (note we currently only support Windows-125x encoding next to the Unicode encodings):

      @data = 
        EXTRACT ...
        FROM ... 
        USING Extractors.Csv(encoding:System.Text.Encoding.GetEncoding("Windows-1252"));