Search code examples
indexingindexerazure-ai-search

How can I obtain the extracted text from the Azure AI Search when querying .msg email files?


I'm trying to access the email body/text that was matched when searching .msg email files that are in an Azure Storage blob container. I am able get the From, To, Subject and Date Sent using:

metadata_content_type metadata_message_from metadata_message_from_email metadata_message_to metadata_message_to_email metadata_message_cc metadata_message_cc_email metadata_message_bcc metadata_message_bcc_email metadata_creation_date metadata_last_modified metadata_subject

Documented here: https://learn.microsoft.com/en-us/azure/search/search-blob-metadata-properties

How can I retrieve the body and attachment text that was matched?

Are there additional fields I can add to my index and/or indexer?

I have tried the following fields:

{
    "name": "email-msg-index",  
    "fields": [
        {"name": "ID", "type": "Edm.String", "key": true, "searchable": false},
        {"name": "metadata_subject", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false},
        {"name": "metadata_content_type", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false},
        {"name": "metadata_message_from", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false},
        {"name": "metadata_message_from_email", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false},
        {"name": "metadata_message_to", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false},
        {"name": "metadata_message_to_email", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false},
        {"name": "metadata_message_cc", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false},
        {"name": "metadata_message_cc_email", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false},
        {"name": "metadata_message_bcc", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false},
        {"name": "metadata_message_bcc_email", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false},
        {"name": "metadata_creation_date", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false},
        {"name": "metadata_last_modified", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false}
    ]
}


Solution

  • I was able to obtain the content using the following index: { "name": "email-msg-index", "fields": [ {"name": "ID", "type": "Edm.String", "key": true, "searchable": false}, {"name": "metadata_subject", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false}, {"name": "metadata_content_type", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false}, {"name": "metadata_message_from", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false}, {"name": "metadata_message_from_email", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false}, {"name": "metadata_message_to", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false}, {"name": "metadata_message_to_email", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false}, {"name": "metadata_message_cc", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false}, {"name": "metadata_message_cc_email", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false}, {"name": "metadata_message_bcc", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false}, {"name": "metadata_message_bcc_email", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false}, {"name": "metadata_creation_date", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false}, {"name": "metadata_last_modified", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": true, "facetable": false}, {"name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false} ] }

    The indexer used is: { "@odata.context": "https://<servicename>.search.windows.net/$metadata#indexers/$entity", "@odata.etag": "\"0x000000000000000\"", "name": "emailindexer", "dataSourceName": "email-blob-datasource", "targetIndexName": "email-msg-index", "parameters": { "configuration": { "indexedFileNameExtensions": ".msg", "dataToExtract": "contentAndMetadata", "parsingMode": "default" } } }

    The query is: { "search": "{{search}}", "select": "metadata_subject, metadata_creation_date, metadata_message_from, metadata_message_to, content", "searchFields": "metadata_subject", "count": true }