I have a task where I have three arrays A,B,C. All of the contain the same data. For the sake of simplicity lets assume the data is numbers 1 to 5. The data would be in different jumbled sequences. I want to find out among B & C which array has data most similar to A.
Eg:
A = 1,2,3,4,5
B = 1,2,3,5,4
C = 4,1,2,3,5
In this case, it is easy to visually comprehend that B is more similar to A. But it gets more complicated for really jumbled sequences.
Eg:
A = 1,2,3,4,5
B = 5,3,1,4,2
C = 4,1,2,3,5
In this case, I would assume C to be more closer to A. I am thinking that this assumption can be quantified as: How many elements have the same sequence in both arrays? In above example the subsequence of [1,2,3] is the same in both arrays. The second question would be what is the offset difference between the similar subsequence ? In this case it is 1, because the subsequence begins at index 0 for A and index 1 for C.
So the number of elements in a matching sequence and their offsets are what I am thinking to use. I plan on adding a weightage to these two entities (number of elements in matching sequence, and offset difference in their occurrence)
Does this make sense? I only need a rough approximation of similarity and the results do not need to be exact. Are there any formal mathematical or data-structure models that solve this problem?
BTW, the project where I need this implemented is in PHP. Does it have any inbuilt functions like the levenstein model for string difference?
Any suggestions are very welcome!
Well I suppose you can come up with your own algorithm (for instance generate all suffixes and then search for them and then define a scoring procedure) or you could use a well known algorithm like
Smith-Waterman for local alignment or Needleman-Wunsch for global. The advantage of these algorithms is that they are well-understood and give you all the possible alignments (and you can choose the best for your case).