Search code examples
solrdataimporthandler

Solr DIH regexTransformer seems to only know about one capturing parentheses group


I am importing data using the DIH and have a need to parse a string, capture two numbers, then populate a field of type=location (which accepts a "lat,long" coordinate pair). The logical thing to do is:

  <field column="latLong" 
         regex="Latitude is ([-\d.]+)\s+ Longitude is ([-\d.]+)\s+" 
         replaceWith="$1,$2" />

It seems the DIH only knows about a single capture group. So $2 is never used.

Has anyone ever used more than one capture with the regexTransformer? Searching the documentation didn't provide any examples of $2 or $3. What gives, O ye priests of Solr?


Solution

  • It is not true that Solr DIH does not understand $2, $3, etc.,

    I just tried this. Added this in DIH data-config.xml:

    <entity name="foo" 
            transformer="RegexTransformer" 
            query="SELECT list_id FROM lists WHERE list_id = ${Lists.id}">
        <field column="firstLastNum" 
               regex="^(\d).*?(\d)$" 
               replaceWith="$1:$2" 
               sourceColName="list_id"/>
    </entity>
    

    and then added the field in my schema.xml

    <field name="firstLastNum" type="string" indexed="true" stored="true"/>
    

    When I indexed a document with list_id = 390, firstLastNum was 3:0 which is indeed correct.

    I suspect that the issue may be because of an incorrect regex which matches only the first part and not the second. Maybe try this regex:

    regex="Latitude is ([-\d.]+)\s*Longitude is ([-\d.]+)"
    

    Another reason could be that latLong is of location type and $1,$2 is of string type, but I am not sure about that.