Finding possibility of occurrence of words in a document using bloom filter

I have a list of words like - list1 = [boy, apple, mango, car] and I have two documents with following content:

document1= The boy driving a car ate apple and mango.
document2= The boy ate an apple.

I just need to find out if the given list of words exist in the document.

In order to check if the words in list1 exist in document I can create a bloom filter for list1 (say bloomlist1) and a bloom filter for document1 (say bloomdocument1). Then I can perform bitwise anding and check if the result is same as bloomlist1. If its same, I can say that all words in list1 exist in document1. So, it will return True.

If I do the same approach for document2 i.e. do bitwise anding, then the result will be False. But I need to get True as a result even if a single word on the list is contained in the document.

Is this possible with bloom filter or do I need any other data structure. If no, what can be the best data structure which fulfills both the criteria.

Solution

I don't think your use of Bloom filters is appropriate.

You state that two filters need to be constructed, one for the word list and another one for the document being searched. Then you bitwise AND the filters. If the result is the same as the original word list filter the document is accepted, otherwise you reject it.

If this is a correct understanding of your process it is quite clear that a document can only pass if it contains all of the words from the list (or by means of hash collisions a few extra bits were set in the document Bloom filter which might cause its acceptance - possibly resulting in a false positive).

If you want to select a document as soon as one word from the list is matched, you only need one Bloom filter for the word list (none for the document being tested). Hash each word from the document, one by one, using the Bloom filter hash functions. Accept the document as soon as the resulting hash matches on all corresponding bits in the word list Bloom filter (this is a hit). You then need to verify that the hit was not due to a false positive.

The "beauty" of a Bloom filter is that it does not suffer from false negatives. That is, if the none of the words from your document have a "hit" in the Bloom filter you can be 100 percent sure that none of the words in the document occur in the associated word list.

One problem you are facing is that Bloom filters are prone to false positives. A false positive is when there is a hit on the Bloom filter for a word not in the associated list. Because of this, you must explicitly verify the "truth" of every hit using the actual word list whenever the Bloom filter indicates a "hit". There is no way around it.

The "art" in constructing a good Bloom filter is to find a set of hashing functions that are cheap to execute and lead a low false positive hit rate (generally this equates to independence and good distribution among the hash functions). A quick google search on Bloom filters will give you plenty of guidance with respect to how many hash functions and how large the filter needs to be to achieve a given false positive rate (assuming good hashing functions).

If you do the calculations you will find that for a word list of any significant size you will need to use at least 7 hash functions to achieve an acceptable false-positive rate. Running 7 hash functions against every word in a large document is going to be "expensive" no matter how you do it. Furthermore, if your documents contain a large number of different words the probablilty of suffering a false positive hit on the Bloom filter may become significant - decreasing its usefulness.

Other answers here suggest that a better approach would be to construct a simple hash table for words in the list then hash each word from the document to see if it has a hit in the list. If so, select the document. Simple. This technique will most likely out perform the Bloom filter approach for this type of application unless there are some very special circumstances you have not outlined in your question.

EDIT - What is the best approach?

There are many ways of doing multi string searches on a given text. It is difficult to say which one is going to be the best solution for your application because most algorithms and their implementations are sensitive to complicating factors (eg. average word size, number of distinct words, document size, number of words in the search list, available memory, probability of having a "hit", etc.). There is no one correct answer here.

Using a Bloom filter may very well be a reasonable approach, however, you should at least look at other alternatives. A few examples might be:

Suffix Trees which can be very efficient for moderately sized word lists.
Compact encoding schemes for multi-string matching
Hybrid algorithms combining Bloom filters plus a Suffix Tree can also be very efficient
And, if you can make some simplifying assumptions, such as patterns to match have the same length, something like this could be useful

The bottom line is you should look at a wider range of strategies before settling on any specific solution.