Search code examples
indexingsolrapache-tikadih

DateFormatTransformer not working with FileListEntityProcessor in Data Import Handler


While indexing data from a local folder on my system, i am using given below configuration.However the lastmodified attribute is getting indexed in the format "Wed 23 May 09:48:08 UTC" , which is not the standard format used by solr for filter queries . So, I am trying to format the lastmodified attribute in the data-config.xml as given below .

<dataConfig>
    <dataSource name="bin" type="BinFileDataSource" />
    <document>
        <entity name="f" dataSource="null" rootEntity="false"
            processor="FileListEntityProcessor"
            baseDir="D://FileBank" 
            fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)" onError="skip"
            recursive="true" transformer="DateFormatTransformer">

            <field column="fileAbsolutePath" name="path" />
            <field column="fileSize" name="size" />
            <field column="fileLastModified" name="lastmodified" dateTimeFormat="YYYY-MM-DDTHH:MM:SS.000Z" locale="en"/>
            <entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor"
            url="${f.fileAbsolutePath}" format="text" onError="skip">
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <!--<field column="text" />-->          
            </entity>
        </entity>
    </document>
</dataConfig>

But there is no effect of transformer, and same format is indexed again . Has anyone got success with this ? Is the above configuration right , or am i missing something ?


Solution

  • Your dateTimeFormat provided does not seem to be correct. The upper and lower case characters have different meaning. Also the format you showed does not match the date text you are trying to parse. So, it probably keeps it as unmatched.

    Also, if you have several different date formats, you could parse them after DIH runs by creating a custom UpdateRequestProcessor chain. You can see schemaless example where there is several date formats as part of auto-mapping, but you could also do the same thing for a specific field explicitly.