Search code examples
phplinuxfile-search

Fast text search in over 600,000 files


I have a php, linux server. It has a folder called notes_docs which contains over 600,000 txt files. The folder structure of notes_docs is as follows -

 - notes_docs
   - files_txt
     - 20170831
           - 1_837837472_abc_file.txt
           - 1_579374743_abc2_file.txt
           - 1_291838733_uridjdh.txt
           - 1_482737439_a8weele.txt
           - 1_733839474_dejsde.txt
     - 20170830
     - 20170829

I have to provide a fast text search utility which can show results on browser. So if my user searches for "new york", all the files which have "new york" in them, should be returned in an array. If user searches for "foo", all files with "foo" in them should be returned.

I already tried the code using scandir, and Directory Iterator, which is too slow. It is taking more than a minute to search, even then the search was not complete. I tried ubuntu find which was again slow taking over a minute to complete. because there are too many folder iterations, and notes_docs current size is over 20 GB.

Any solution which I can use to make it faster are welcome. I can make design changes, integrate my PHP code to curl to another language code. I can make infrastructure changes too in extreme cases (as in using in memory something).

I want to know how people in industry do this? People at Indeed, Zip Recruiter all provide file search.

Please note I have 2GB - 4GB to RAM, so loading all the files on RAM all the time is not acceptable.

EDIT - All the below inputs are great. For those who come later, We ended up using Lucene for indexing and text-search. It performed really well


Solution

  • To keep it simple: There is no fast way to open, search and close 600k documents every time you want to do a search. Your benchmarks with "over a minute" are probably with single test accounts. If you plan to search these via a multi-user website, you can quickly forget about it, because your disk IO will be off the charts and block your entire server.

    So your only options is to index all files. Just as every other quick search utility does. No matter if you use Solr or ElasticSearch as mentioned in the comments, or build something of your own. The files will be indexed.

    Considering the txt files are text versions of pdf files you receive, I'm betting the easiest solution is to write the text to a database instead of a file. It won't take up much more disk space anyway.

    Then you can enable full text search on your database (mysql, mssql and others support it) and I'm sure the response times will be a lot better. Keep in mind that creating these indexes do require storage space, but the same goes for other solutions.

    Now if you really want to speed things up, you could try to parse the resumes on a more detailed level. Try and retrieve locations, education, spoken languages and other information you regularly search for and put them in separate tables/columns. This is a very difficult task and almost a project on it's own, but if you want a valuable search result, this is the way to go. Because searching in text without context gives very different results, just think of your example "new york":

    1. I live in New York
    2. I studied at New York University
    3. I love the song "new york" from Alicia Keys in a personal bio
    4. I worked for New York Pizza
    5. I was born in new yorkshire, UK
    6. I spent a summer breeding new yorkshire terriers.