Search code examples
sqluser-defined-functionsteradatadata-warehousesqlperformance

Teradata SQL Performance Tuning: STRTOK_SPLIT_TO_TABLE: EXPLAIN Failed. 3738: String is longer than 31000 characters


I am again probably hoping against hope but while using STRTOK_SPLIT_TO_TABLE on an in-list, where, my in-list is > 32K Character, it gives me the above error

EXPLAIN Failed. 3738:  String is longer than 31000 characters

What I did

Split the string and used 2 STRTOK_SPLIT_TO_TABLE creating 2 different tables and then joined to my main table

The actual Error translation from TD Error documentation advises to use "USING followed by data parcel". As I understand you can't do all that stuff in UDF So other than split the string, is there any other approach? I have 400K in-list strings that I want to avoid loading to VT because that seems to have its own overhead cost, so using this function, but it never worked.


    EXPLAIN 
SELECT  COL1 
FROM    DB.TB1 , ( 
SELECT  TOKEN2 
FROM    TABLE ( STRTOK_SPLIT_TO_TABLE( 1 , '123456,123456,43456,......20K Long string' ,
','  ) 
RETURNS ( OUTKEY INT , TOKENNUM INT , TOKEN2 VARCHAR ( 20 ) CHARACTER 
SET unicode ) ) DD ) D2 ,


 ( 
SELECT  TOKEN1
FROM    TABLE ( STRTOK_SPLIT_TO_TABLE( 1 , '123456,123456,43456.....20K long string' ,','  ) 
RETURNS ( OUTKEY INT , TOKENNUM INT , TOKEN1 VARCHAR ( 20 ) CHARACTER 
SET unicode ) ) DD ) D1 

WHERE   SUBSTR ( TB1.COL2 , 21 , 9 ) = d2.token2 

or  SUBSTR ( TB1.COL2 , 21 , 9 ) = d1.token1
GROUP BY 1 ;


/* the same logic can be re-wrritten 
   with a UNION  or  Where EXISTS logic  
      but focus's on that UDF */

So to summarize ....

  • I have to FURTHER SPLIT the String length to smaller bits. Two is not enough. How do you harness the UDF features when the string is really that big that its worth running the Split string UDF? This is the error I get when I use smaller strings. My explain will run fine, because the UDF's throwing this err

    Query Failed. 9134:  STRTOK: Input string exceeded the max. allowed string   length. 
    
  • It will NOT let me use LATIN. My Tables are LATIN so it uses translate overhead that causes char set conversion. Is there a work-around here? (Maybe translate the output column from the UDF yourself so the optimizer doesn't do it on the the join column. But that wouldn't matter so much in terms of improving since the optimizer would choose to translate the char set with lesser rows anyway )

BTW: I am on 14.1 and as per below

    instring
a character or CLOB argument.
If instring is NULL, the STRTOK_SPLIT_TO_TABLE returns NULL.

From the documentation - I can stick a CLOB for instring. So even AFTER my > 32K string is Cast as CLOB it throws

EXPLAIN Failed. 3738:  String is longer than 31000 characters

Finally ....Using character_length ('string in D1') and (' ditto d2') I get 19K. It says varchar (32000 ) limit for instring, so wondered why the reject?

Update : I could CLOB D1 and D2 and limit to a 2 strings approach and sneak out of the 9134 error . But 2 Caveats still exist -

  • this UDF only likes UNICODE for token . I added a 'translation' to the token for whatever it's worth.
  • I dont quite follow how the 32K Logic works here when I test it with this

    sel Character_length ('string here' )

    Returns me a number < 32K so my string should not be subject to the 32K cap

So using this in the smaller component of the query works fine. I can convert the string list to tables , have these join and pull out the ID list that I was interested but when I plug in the component into the bigger report , its freaks out - "segementation fault in Local AMP dont re-submit" .... So I had to adapt another approch.


Solution

  • This is probably a bug in teradata UDF where inspite of the instring length being < 32K it will reject it.

    sel char_length ('string 32K or less' ) will give o/p 
    31890 
    

    When I use it here

    SELECT  COL1 
    FROM    DB.TB1 , ( 
    SELECT  TOKEN2 
    FROM    TABLE ( STRTOK_SPLIT_TO_TABLE( 1 , '123456,123456,43456,......32K Long string' ,
    ','  ) 
    RETURNS ( OUTKEY INT , TOKENNUM INT , TOKEN2 VARCHAR ( 20 ) CHARACTER 
    SET unicode ) ) DD ) D2 
    

    It will fail and the failure point is right around half of 32K . It will fail for > 16K characters. But using CLOB makes it work Other than a bug , I can't offer another explanation

    SELECT  COL1 
    FROM    DB.TB1 , ( 
    SELECT  TOKEN2 
    FROM    TABLE ( STRTOK_SPLIT_TO_TABLE( 1 , '123456,123456,43456,......32K Long string' ( CLOB)  ,
    ','  ) 
    RETURNS ( OUTKEY INT , TOKENNUM INT , TOKEN2 VARCHAR ( 20 ) CHARACTER 
    SET unicode ) ) DD ) D2