Search code examples
javaservletsibm-watsonretrieve-and-rank

Upload documents into Watson's Retrieve & Rank service


I'm implementing a solution using Watson's Retrieve & Rank service.

When I use the tooling interface, I upload my documents and they appear as a list, where I can click on any of them to open up all the Titles that are inside the document ( Answer Units ), as you can see on the Picture 1 and Picture 2.

When I try to upload documents via Java, it wont recognize the documents, they get uploaded in parts ( Answer units as documents ), each part as a new document.

I would like to know how can I upload my documents as a entire document and not only parts of it?

Here's the codes for the upload function in Java:

    public Answers ConvertToUnits(File doc, String collection) throws ParseException, SolrServerException, IOException{
    DC.setUsernameAndPassword(USERNAME,PASSWORD);
    Answers response = DC.convertDocumentToAnswer(doc).execute();
    SolrInputDocument newdoc = new SolrInputDocument();
    WatsonProcessing wp = new WatsonProcessing();
    Collection<SolrInputDocument> newdocs = new ArrayList<SolrInputDocument>();

    for(int i=0; i<response.getAnswerUnits().size(); i++)
    {
        String titulo = response.getAnswerUnits().get(i).getTitle();
        String id = response.getAnswerUnits().get(i).getId();
        newdoc.addField("title", titulo);
        for(int j=0; j<response.getAnswerUnits().get(i).getContent().size(); j++)
        {
            String texto = response.getAnswerUnits().get(i).getContent().get(j).getText();
            newdoc.addField("body", texto);

        }
        wp.IndexDocument(newdoc,collection);
        newdoc.clear();
    }
    wp.ComitChanges(collection);
    return response;
}


      public void IndexDocument(SolrInputDocument newdoc, String collection) throws SolrServerException, IOException
  {
      UpdateRequest update = new UpdateRequest();
      update.add(newdoc);
      UpdateResponse addResponse = solrClient.add(collection, newdoc);
  }

Solution

  • You can specify config options in this line:

    Answers response = DC.convertDocumentToAnswer(doc).execute();
    

    I think something like this should do the trick:

    String configAsString = "{ \"conversion_target\":\"answer_units\", \"answer_units\": { \"selector_tags\": [] } }";
    
    JsonParser jsonParser = new JsonParser();
    JsonObject customConfig = jsonParser.parse(configAsString).getAsJsonObject();    
    
    Answers response = DC.convertDocumentToAnswer(doc, null, customConfig).execute();
    

    I've not tried it out, so might not have got the syntax exactly right, but hopefully this will get you on the right track.

    Essentially, what I'm trying to do here is use the selector_tags option in the config (see https://www.ibm.com/watson/developercloud/doc/document-conversion/customizing.shtml#htmlau for doc on this) to specify which tags the document should be split on. By specifying an empty list with no tags in, it results in it not being split at all - and coming out in a single answer unit as you want.

    (Note that you can do this through the tooling interface, too - by unticking the "Split my documents up into individual answers for me" option when you upload the document)