Search code examples
c#textinformation-retrieval

Text Extraction library from different file types, PDF ,DOC, DOCX, TXT c#


I'm Building Information Retrieval System that search text in multi files formats, I have Tried EPocalipse IFilter Lirary but it through an exception when trying to read docx files, and I tried Toxy Library it though an exception for doc arabic files, finally I tried TikaOnDotNet Libray but it need java to work and I need to put the system online on hosting that don't have java installed on server


Solution

  • What about using such libraries :

    For DOC/DOCX: http://www.dotnetperls.com/word

    For PDF: https://github.com/itext/itextsharp

    For TXT: https://msdn.microsoft.com/en-us/library/ms143368(v=vs.110).aspx