Search code examples
javadata-structuresb-treebinary-search-tree

Optimal on-disk data structure for searching a file?


I've spent a couple of hours reading posts that were related to the question in a bid to try and come up with a solution, but I wasn't really successful in coming up with one.

So here goes: I was once asked in an interview which data structure I would use to search a if a particular word existed in a file. The file is also supposedly big enough to not be able to fit in the memory and the interviewer was really looking for an on-disk solution.

Is the B-Tree an on-disk data structure?

A Binary Search Tree is an in-memory data structure isn't it?


Solution

  • There are really two different possible questions here:

    1. Given a massive file, and a word, how do you check if the word exists in the file?

    2. Given a massive file, how do you build an index so that you can efficiently check if an arbitrary word exists in the file?

    The first problem is efficiently solved with Boyer-Moore and a linear search through the file. If you're only searching once, building an index is a complete waste of time.

    Regarding the second problem, it sounds like the interviewer is really pushing B-Trees.