Search code examples
elasticsearchmediawiki

MediaWiki search for text in files NOT in upload directory


I have a small Wiki running locally on an Ubuntu 20.04 box.

My goal is to index files (pdf mainly, but also docx and pptx) in a mounted directory and search for text within them. I can do a full-text search in pdf files that were uploaded into the Wiki using the TikaAllTheFiles extension, how can I include the files in an external directory?

my setup:

  • MediaWiki 1.39.3
  • PostgreSQL 12.15
  • Elasticsearch 7.10.2
  • MW-Extension CirrusSearch 6.5.4
  • MW-Extension Elastica 6.2.0
  • MW-Extension TikaAllTheFiles 1.0.1

Solution

  • You will have to 'upload' your files in to the wiki. So all the files will have file pages. Here are some tools that can do a batch upload

    https://mediawiki.org/wiki/Category:Bulk_upload

    I think this is the best option for you:

    https://mediawiki.org/wiki/Manual:ImportImages.php

    You can change the directory where the files are stored, this can be external, like an AWS S3 bucket, you can also configure to use your mounted storage for uploaded files, see:

    https://www.mediawiki.org/wiki/Manual:$wgForeignFileRepos Manual:$wgForeignFileRepos - a more flexible way of configuring shared upload repositories (and the only way, if you want to set up more than one shared upload source)