How to extract a Teradata .TPT file with UTF-8 encoding

We are currently extracting several Teradata .TPT files that we will upload to AWS S3, however the files are coming with ANSI encode

I need them to come with encode UTF-8

Solution

You must specify the character set in your TPT script. At the top add:

USING CHARACTER SET UTF8

The tricky part is that UTF8 here has 3 bytes per character, so in your DEFINE SCHEMA you must triple the size of each field.

For example if your schema looks like:

   DEFINE SCHEMA s_some_export
    (
            status VARCHAR(20),
            userid VARCHAR(20),
            firstname VARCHAR(64),
    );

You'll have to triple the values to accommodate your UTF8 characters:

   DEFINE SCHEMA s_some_export
    (
            status VARCHAR(60),
            userid VARCHAR(60),
            firstname VARCHAR(192),
    );

Sometimes, because I'm lazy, I define my TPT with USING CHARACTER SET UTF16 so that I only need double each field size (the math is easier). BUT it means I have to convert it to UTF8 after extraction. In Linux this would just be iconv -f UTF-16LE -t UTF-8 myoutputfile.csv > myoutputfile.utf8.csv

Some caveats:

If your table's field is defined as CHAR and CHARACTER SET LATIN then you may run into column size issues with your schema. see here
Dates and Timestamps can get wierd as they don't need to be doubled so defining them as VARCHAR in your schema can get you into trouble. You may have to fuss around a bit here. My suggestion would be to change the view from which you are selecting the data for you TPT and CAST(yourdate AS VARCHAR(10)) as yourdate and then use VARCHAR(30) in your schema so you don't have to think about the field types while defining your schema. This means extra CPU overhead in your extraction, but unless you are running tight on resources I think it's worth it. I'm also very lazy that way and always happy to just get the damned TPT to extract data without much debugging.