Indexing PDF with Solr

Can anyone point me to a tutorial.

My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs.

But it makes very little sense to me. Do I need to install Tika?

Im lost - please help

Solution

The hardest part of this is getting the metadata from the PDFs, using a tool like Aperture simplifies this. There must be tonnes of these tools

Aperture is a Java framework for extracting and querying full-text content and metadata from PDF files

Apeture grabbed the metadata from the PDFs and stored it in xml files.

I parsed the xml files using lxml and posted them to solr