U-SQL Error Extracting from TXT file

when running my extract, got this error:

Found invalid character-encoding for UTF-8 encoding in input. The input file may contain corrupted data, or the specified input encoding in the extractor does not match the actual file encoding. See the DETAILS section for a hexadecimal dump of the file segment containing the invalid character-encoding.

I am not able to read UTF-8 character data through below U-SQL script.

@cgadmdomain =
EXTRACT 
row_id string,
orgarea_name string,
last_changed_time string,
start_date string,
stop_date string,
domain_name string,
gui_description string,
media string,
direction string,
distribution string,
threshold1 string,
threshold2 string


FROM @cgadmdomainInPath USING Extractors.Text(delimiter: ';');

File has the data "Test Kö CB" for media column . If I remove this particular record then my script is running fine,please let me know if i need to add anything to the parameters

Solution

Are you sure that the file is encoded in UTF-8 and not some other encoding? What is the byte sequence that you see if you open the file with a byte level editor?

Depending on that, you may have to set it to the appropriate Windows-125x encoding or Unicode.

If your data is for example encoded with Windows-1252, you can extract it with the following statement (note we currently only support Windows-125x encoding next to the Unicode encodings):

  @data = 
    EXTRACT ...
    FROM ... 
    USING Extractors.Csv(encoding:System.Text.Encoding.GetEncoding("Windows-1252"));