Search code examples
jsonparsingelasticsearchkibanapst

how can I Parse/Index PST file to elasticsearch?


I am able to parse json file in elasticsaerch. is there anyway to parse/index Microsoft outlooks PST files to Elasticsearch indexes??

thank you very much


Solution

  • You can use the ElasticSearch plugin "Ingest Attachment", which uses Tika to process natives (PDF, XLS, PST, etc...):

    https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-attachment.html

    The "Ingest Attachment" plugin is formerly named "Mapper-Attachments" plugin, so you may find help with keywords from the old name:

    https://www.elastic.co/guide/en/elasticsearch/plugins/current/mapper-attachments.html

    Those plugins allow you to pass the base64 encoded PST directly to ElasticSearch, and ES will parse and index the data behind the scene for you automatically.

    If you want something custom, I suggest using one of the many github projects that read PST files and then send the data to ElasticSearch in whatever document mapping you want. There are many github PST reader projects, so pick a popular one for whatever language you're most comfortable with (java, C#, etc...). Github suggested search terms: libpst, pst reader

    You could also write a custom parser for Apache Tika, and use that instead of a PST reader library. Documentation on how to use that can be found here:

    https://tika.apache.org/1.6/parser.html

    Java example to base64 encode a file to string:

    FileInputStream fileInputStreamReader = new FileInputStream(file);
    byte[] bytes = new byte[(int)file.length()];
    fileInputStreamReader.read(bytes);
    String encodedfile = Base64.encodeBase64(bytes).toString();
    

    Pass the resulting encodedfile string to a PUT call like this article shows:

    https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html