Search code examples
javaparsingfielddata-extractionodt

Extract fields from ODT document using Java library


I need to use a Java library - or code - to extract field tags from the content of an ODT document. I know odt is some sort of zipped file and it has its contents ina a content.xml file. Of course I could just extract the files, open content.xml and parse it, but I believe some higher level code exists. Just as an example, the content looks like this:

<text:p text:style-name="Standard">Hi ${name}!</text:p>    
<text:p text:style-name="Standard">
<text:text-input text:description="JOOScript">$nome</text:text-input></text:p>

I would like to extract the fields as ${name} and $nome.

I know Apache Tika could be used for that, but I haven't spotted an example that actually shows field extraction. I believe this is because the fields I am using are unstructured text instead of input field tags.

Thanks in advance, Daniel


Solution

  • Well, just in case anyone is interested, we ended up using Apache Tika for obtaining the content from the odt and we have parsed it using the following regular expression:

    \$\{[\w\-\.]*\}