Search code examples
sqloracle-databaseblobhtml-parsing

Oracle remove html from clob fields


I have a simple function to convert html blob to plain text

 FUNCTION HTML_TO_TEXT(html IN CLOB) RETURN CLOB
 IS v_return CLOB;
   BEGIN     
select utl_i18n.unescape_reference(regexp_replace(html, '<.+?>', ' ')) INTO v_return from dual;
return (v_return);
    END;

called in that way:

SELECT A, B, C, HTML_TO_TEXT(BLobField) FROM t1

all works fine until BlobFields contains more than 4000 character, then i got

ORA-01704: string literal too long
01704. 00000 -  "string literal too long"
*Cause:    The string literal is longer than 4000 characters.
*Action:   Use a string literal of at most 4000 characters.
           Longer values may only be entered using bind variables.

i try to avoud string inside function using variables but nothing changes:

FUNCTION HTML_TO_TEXT(html IN CLOB) RETURN CLOB
 IS v_return CLOB;
 "stringa" CLOB;
   BEGIN 
    SELECT regexp_replace(html, '<.+?>', ' ') INTO "stringa" FROM DUAL;
    select utl_i18n.unescape_reference("stringa") INTO v_return from dual;
    return (v_return);
END;

Solution

  • Do not use regular expressions to parse HTML. If you want to extract the text then use an XML parser:

    SELECT a,
           b,
           c,
           UTL_I18N.UNESCAPE_REFERENCE(
             XMLQUERY(
               '//text()'
               PASSING XMLTYPE(blobfield, 1)
               RETURNING CONTENT
             ).getStringVal()
           ) AS text
    FROM   t1
    

    Which will work where the extracted text is 4000 characters or less (since XMLTYPE.getStringVal() will return a VARCHAR2 data type and UTL_I18N.UNESCAPE_REFERENCE accepts a VARCHAR2 argument).


    If you want to get it to work on CLOB values then you can still use XMLQUERY and getClobVal() but UTL_I18N.UNESCAPE_REFERENCE still only works on VARCHAR2 input (and not CLOBs) so you will need to split the CLOB into segments and parse those and concatenate them once you are done.

    Something like:

    CREATE FUNCTION html_to_text(
      i_xml IN XMLTYPE
    ) RETURN CLOB
    IS
      v_text   CLOB;
      v_output CLOB;
      str      VARCHAR2(4000);
      len      PLS_INTEGER;
      pos      PLS_INTEGER := 1;
      lim      CONSTANT PLS_INTEGER := 4000;
    BEGIN
      SELECT XMLQUERY(
               '//text()'
               PASSING i_xml
               RETURNING CONTENT
             ).getStringVal()
      INTO   v_text
      FROM   DUAL;
      
      len := LENGTH(v_text);
      WHILE pos <= len LOOP
        str := DBMS_LOB.SUBSTR(v_text, lim, pos);
        v_output := v_output || UTL_I18N.UNESCAPE_REFERENCE(str);
        pos := pos + lim;
      END LOOP;
      
      RETURN v_output;
    END;
    /
    

    However, you probably want to make it more robust and check if you are going to split the string in the middle of an escaped XML character.

    db<>fiddle here