java regex algorithm semantics semantic-analysis

Optimizing search for keywords in two strings

I have two Strings that I am checking for specific common words in both of them. I already have the semantic scores; irrelevant in this case as these words are technical abbreviations and have special emphasis. The more set of common words they have, higher the score and closer they are.

There are many ways of going about this. So far I have thought of two.

1) I create two ArrayList with the words of the strings. I have to another set of words that I search if they exist in both the ArrayList. If they do, I give them a score +1.

then I can have multiple conditions like

 if((firstString.contains(keyWord)) && (secondString.contains(keyWord)))
  then +1
 if((firstString.contains(anotherKeyWord)) && (secondString.contains(anotherKeyWord)))
  then +1

2> Take two string and have regex search using

if firstString.("(.*)someExpression(.*)")) && secondString.("(.*)someExpression(.*)"))
then +1
if firstString.("(.*)someOtherExpression(.*)")) && secondString.("(.*)someOtherExpression(.*)"))
then +1

Are there other better ways of doing this? I am more inclined to use regex now. It looks pretty efficient way of doing this.

Basically what I am doing is I am trying to cluster similar sentences by grouping sentences with abbreviations such as "ACLS", "ASHD", "CXR" (Common medical terms) as I know these sentences talk about those issues primarily. Then I get semantic scores to group those sentences that have these words in them. Wrong Approach :/ ?

Thank you :)

Solution

This really depends on how efficient you want your algorithm. If I am to choose from the two different approaches that you currently suggest, I'd go with a simple contains() check. Regular expressions are good for matching of patterns with variations. They are overkill for an exact match scenario that you have here. In the best case the amount of time needed for compiling all the different regexes you end up with is going to make them slower than the simple contains() approach.

However, there are faster ways. For example you can split each string into its containing words and add them to a hashset(basically a set that is implemented as a hashtable). Then you would use the intersect operation of the hashset(worst case O(n)) to get the common words. This is also a hashset. Then you check if these common words can be found in your list of known words(can also be a hashtable) and increase the scores. With this approach you skip all the string matchings of your proposed approach.