Search code examples
pythonpandastwitterteradatatwitterapi-python

Teradata - An illegally formed character string was encountered during translation


I am fetching tweets via Twitter API in pandas dataframe and writing the data to teradata database. However, unlike other tweets one cell has specific tweet which contains data in bold. When I try to insert it in database, it pops up the following error:

OperationalError: [Version 17.0.0.4] [Session 3046127] [Teradata SQL Driver] [Error 528] A failure occurred while executing rows 1 through 292 of a batch request.
 at gosqldriver/teradatasql.(*teradataConnection).makeDriverErrorCode TeradataConnection.go:1120
 at gosqldriver/teradatasql.newTeradataRows TeradataRows.go:396
 at gosqldriver/teradatasql.(*teradataStatement).QueryContext TeradataStatement.go:122
 at gosqldriver/teradatasql.(*teradataConnection).QueryContext TeradataConnection.go:2083
 at database/sql.ctxDriverQuery ctxutil.go:48
 at database/sql.(*DB).queryDC.func1 sql.go:1579
 at database/sql.withLock sql.go:3204
 at database/sql.(*DB).queryDC sql.go:1574
 at database/sql.(*Conn).QueryContext sql.go:1823
 at main.goCreateRows goside.go:654
 at main._cgoexpwrap_cfa80c8a3acb_goCreateRows _cgo_gotypes.go:363
 at runtime.cgocallbackg1 cgocall.go:332
 at runtime.cgocallbackg cgocall.go:207
 at runtime.cgocallback_gofunc asm_amd64.s:793
 at runtime.goexit asm_amd64.s:1373
Caused by [Version 17.0.0.4] [Session 3046127] [Teradata Database] [Error 6705] An illegally formed character string was encountered during translation.
 at gosqldriver/teradatasql.(*teradataConnection).formatDatabaseError TeradataConnection.go:1138
 at gosqldriver/teradatasql.(*teradataConnection).makeChainedDatabaseError TeradataConnection.go:1154

The tweets datatype in database is "varchar(1000) CHARACTER SET UNICODE NOT CASESPECIFIC"

Here is the sample data:

enter image description here

The tweet containing bold text is causing the problem in insertion. How do I mitigate this?


Solution

  • To store or retrieve arbitrary Unicode code points, use the Unicode Pass-Through feature both for loading and querying sessions.

    SET SESSION CHARACTER SET UNICODE PASS THROUGH ON;
    

    For the specific example given, you might find it useful to "normalize" the Unicode text, e.g. with Python unicodedata.normalize before loading or Teradata TRANSLATE(...) after loading if you wanted the corresponding ASCII letter characters - but that would not apply for other Unicode characters such as emoji that may also occur in the input.