Search code examples
algorithmhashhashcodestring-hashinginterval-tree

Hashing methodology for collection of strings and integer ranges


I have a data, for example per the following: enter image description here

I need to match the content with the input provided for the Content & Range fields to return the matching rows. As you can see the Content field is a collection of strings & the Range field is a range between two numbers. I am looking at hashing the data, to be used for matching with the hashed input. Was thinking about Iterating through the collection of individual strings hashcode & storing it for the Content field. For the Range field I was looking at using interval trees. But then the challenge is when i hash the Content input & Range input how will i find if it that hashcode is present in the hashcode generated for the collection of strings in the Content fields & the same for the Range fields.

Please do let me know if there are any other alternate ways in which this can be achieved. Thanks.


Solution

  • There is a simple solution to your problem: Inverted Index.

    For each item in content, create the inverted index that maps 'Content' to 'RowID', i.e. create another table of 2 columns viz. Content(string), RowIDs(comma separated strings).

    For your first row, add the entries {Azd, 1}, {Zax, 1}, {Gfd, 1}..., {Mni, 1} in that table. For the second row, add entries for new Content strings. For the Content string already present in the first row ('Gfd', for example), just append the new row id to the entry you created for first row. So, Gfd's row will look like {Gfd, 1,2}.

    When done processing, you will have the table that will have 'Content' strings mapped to all the rows in which this content string is present.


    Do the same inverted indexing for mapping 'Range' to 'RowID' and create another table of Range(int), RowIDs(comma seperated strings).

    Now, you will have a table whose rows will tell which range is present in which row ids.


    Finally, for each query that you have to process, get the corresponding Content and Range row from the inverted index tables and do an intersection of those comma seperated list. You will get your answer.