Search code examples
solrfull-text-searchsolrjapache-tikasolr-cell

Indexing PDF with Solr


Can anyone point me to a tutorial.

My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs.

I have seen this: http://wiki.apache.org/solr/ExtractingRequestHandler

But it makes very little sense to me. Do I need to install Tika?

Im lost - please help


Solution

  • The hardest part of this is getting the metadata from the PDFs, using a tool like Aperture simplifies this. There must be tonnes of these tools

    Aperture is a Java framework for extracting and querying full-text content and metadata from PDF files

    Apeture grabbed the metadata from the PDFs and stored it in xml files.

    I parsed the xml files using lxml and posted them to solr