I have a website running on Apache
with PHP
and MySQL
.
I wish to implement a custom search engine on text that is stored in the MySQL
table and on .pdf
and .docx
documents.
I am not sure which API to go for.
I have looked at Google's Custom Search Engine (CSE) and Elastic Search. Elastic, I have learnt, can only be run on a Java-based server and so I am unable to go down that route.
I know Elastic can handle my requirements through its REST api. Is Google CSE able to do the same, i.e. search through text stored in Database tables and PDFs? Any other custom search APIs out there that are possible?
Solutions such as Google Custom Search Engine (in your case Google Site Search) or even any other web robot (such as Nutch) will only read the web-side of things: what is accessible by a browser (not being logged in) and classify this by URL displaying web-pages (with a title and text content’s extract).
If all PDFs, docx and web-pages are accessible without login, it works extremely well. The web-app creator should enable that. It does not mean that the normal user has access to all, just the robot (e.g. the Springer publisher invites the Google bot to just about all content but not a normal browser).
If you want a search server to access only the fields of your database, it needs to talk to your database. Google Site Search (a form of Google Custom Search) does not allow for that. ElasticSearch and Apache Solr allow that. However, most web hosting services do not make the database ports accessible from outside for security reasons. Thus, you see another requirement for you to run a search server on premises probably.
The requirement to run java or Google CSE seems unavoidable. I know of no solutions of the same quality in other languages (e.g. Drupal can offer a MySQL-based text search but it has a far lower tolerance). Nowadays, many cloud nodes can run java.