Search code examples
javartf

How do I extract inline files in RTF format into a list of files?


I am getting data from an SQL Server database and the content of the column is stored in RTF format. I am using RTF Parser Kit and I have managed to convert the rtf to text.

InputStream is = new ByteArrayInputStream(input.getBytes());
StringTextConverter converter = new StringTextConverter();
converter.convert(new RtfStreamSource(is));
input = converter.getText();

However, some inputs contain inline images. Is there a way to extract all these images in an ArrayList<File> files = new ArrayList<>(); using the RTF Parser Kit ?

For example, the following :

enter image description here

is stored like:

{\rtf1\ansi\ansicpg1253\uc1\deff1\adeff1\deflang0\deflangfe0\adeflang0{\fonttbl
{\f0\fswiss\fcharset0\fprq2{\*\panose 020B0604020202020204}Arial;}
{\f1\fswiss\fcharset161\fprq2 Arial Greek;}
{\f2\froman\fcharset2\fprq2{\*\panose 05050102010706020507}Symbol;}}
{\colortbl;\red0\green0\blue0;}
{\stylesheet{\s0\ltrpar\itap0\nowidctlpar\ql\li0\ri0\lin0\rin0\cbpat0\rtlch\af1\afs24\ltrch\f1\fs24 [Normal];}{\*\cs10\additive Default Paragraph Font;}}
{\info
{\*\txInfo{\txVer 24.0.712.500}}}
\paperw12240\paperh15840\margl1138\margt1138\margr1138\margb1138\deftab1134\widowctrl\lytexcttp\formshade\sectd
\headery567\footery567\pgwsxn12240\pghsxn15840\marglsxn1138\margtsxn1138\margrsxn1138\margbsxn1138\pgbrdropt32\pard\ltrpar\itap0\nowidctlpar\ql\li0\ri0\lin0\rin0\plain\rtlch\af0\afs20\alang1033\ltrch\f0\fs20\lang1033\langnp1033\langfe1033\langfenp1033 Hello, \par\par This is an example!\par\par
{\shp{\*\shpinst\shpleft0\shptop0\shpright7500\shpbottom3450\shpfhdr0\shpbxcolumn\shpbxignore\shpbypara\shpbyignore\shpwr2\shpwrk0\shpz0\shplid1025{\sp{\sn shapeType}{\sv 75}}{\sp{\sn fFlipH}{\sv 0}}{\sp{\sn fFlipV}{\sv 0}}{\sp{\sn wzName}{\sv _tx_id_1_}}{\sp{\sn pib}{\sv {\pict\jpegblip\picw13229\pich6085\picwgoal7500\pichgoal3450\picscalex100\picscaley100 
ffd8ffe000104a46494600010101006000600000ffe101024578696600004d4d002a000000080007011a0005000000010000
0062011b0005000000010000006a012800030000000100020000013100020000001100000072013b00020000000700000084
ff0011dbd85aebb65a6ead089ededb55b7d2f51d5b4f8af6253b654b4d4af61071b6e24fbc403d62800a002800a002800a00
2800a002800a002800a002800a002800a002800a002800a002800a002800a002800a002800a002800a002800a002800a0028
00a002800a002800a002800a002800a002800a002803ffd9}}}\par }

In the above example I paste a small fragment of the image, as the image representation was too large to fit here.


Solution

  • I've added some sample code for you to the RTF Parser Kit repository in the form of an ImageListener class which you'll find here and a simple "driver" class which shows how it is used here.

    The code takes a naive approach, simply looking for the pict command, expecting it to be the first command in a group, then adding any further commands in the group and their parameters to a Map (including the image data as a string of hex data). Once it gets to the end of the group it'll call a method you provide, passing the populated Map.

    Unfortunately this is where the hard work starts as you'll need to interpret the keys in the Map to determine what the image type is and based on that, how to convert the hex data into an image you can work with.