How to get more insight into why documents fail to be ingested in Watson Discovery Service

I'm using the DiscoveryV1 module of the watson_developer_cloud python library to ingest 700+ documents into a WDS collection. Each time I attempt a bulk-ingestion many of the documents fail to be ingested, it is nondeterministic, usually around 100 documents fail.

Each time I call discovery.add_document(env_id, cold_id, file_info=file_info) I find that the response contains a WDS document_id. After I've made this call for all documents in my corpus I use the corresponding document_ids to call discovery.get_document(env_id, col_id, doc_id) and check the document's status. Around 100 of these calls will return the status Document failed to be ingested and indexed. There is no pattern among the files that fail, they range in size and of both msword (doc) and pdf file types.

My code to ingest a document was written based on the WDS Documentation, it looks something like this:

with open(f_path) as file_data:
    if f_path.endswith('.doc') or f_path.endswith('.docx'):
        re = discovery.add_document(env_id, col_id, file_info=file_data, mime_type='application/msword')                      
    else:                                                                                        
        re = discovery.add_document(env_id, col_id, file_info=file_data)

Because my corpus is relatively large, ~3gb in size, I recieve Service is busy processing... responses from discovery.add_document(env_id, cold_id, file_info=file_info) calls in which case I call sleep(5) and try again.

I've exhausted the WDS documentation without any luck. How can I get more insight into the reason that these files are failing to be ingested?

Solution

You should be able to use the https://watson-api-explorer.mybluemix.net/apis/discovery-v1#!/Queries/queryNotices API to see errors/warnings that happen during ingestion along with details that might give more information on why the ingestion failed.

Unfortunately, at the time of this posting it does not look like the python SDK has a method to wrap this API yet, so you can use the Watson Discovery Tooling or use curl to query the API directly (replacing the values in {} with your collection-specific values)

curl -u "{username}:{password}" "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/notices?version=2017-01-01