Search code examples
linuxshellextractrtf

Extract information from RTF file in shell script


We have many RTF files which we need to upload in Oracle EBS to their respective category. To do so we need to read some info stored in Document Properties of RTF file. These fields are Title, Subject, Author, Company and Category.

When we open a RTF file in notepad, we can see this info but not sure how to extract it using linux command. Using grep wasn't very successful.

I am pasting here part of RTF file which holds this info

\mwrapIndent1440\mintLim0\mnaryLim1}{\info**{\title ^XXSLS_GBL_ORDACK^}****{\subject XXSLS}****{\author ^es_ES,es_FR,ES_IT,ES_de^}**{\doccomm $Header: XXSLS_GBL_ORDACK_ES_ES.rtf $}
{\operator }{\creatim\yr2012\mo11\dy11\hr14\min3}{\revtim\yr2013\mo3\dy2\hr10\min43}{\version24}{\edmins361}{\nofpages4}{\nofwords725}{\nofchars14202}{\*\manager }{\*\company }**{\*\category ^BD^}**{\nofcharsws14898}
{\vern32773}}{\*\userprops {\propname _DocHome}\proptype3{\staticval -974575144}}{\*\xmlnstbl {\xmlns1 http://schemas.microsoft.com/office/word/2003/wordml}}\paperw11850\paperh18144\margl851\margr851\margt851\margb0\gutter0\ltrsect

Can someone please suggest how we can extract this info as follows:

Title=^XXSLS_GBL_ORDACK^
Subject=XXSLS
Author=^es_ES,es_FR,ES_IT,ES_de^
Category=^BD^

Solution

  • Grep can do it with the -E (advanced regex) flag and -o (only matching output) flag.

     title=`grep -oE 'title [^\}]+' file.rtf | sed 's/title //g'`
     echo "title=$title"
     subject=`grep -oE 'subject [^\}]+' file.rtf | sed 's/subject //g'`
     echo "subject=$subject"
     author=`grep -oE 'author [^\}]+' file.rtf | sed 's/author //g'`
     echo "author=$author"
     category=`grep -oE 'category [^\}]+' file.rtf | sed 's/category //g'`
     echo "category=$category"
    

    I get

    title=^XXSLS_GBL_ORDACK^
    subject=XXSLS
    author=^es_ES,es_FR,ES_IT,ES_de^
    category=^BD^