Search code examples
algorithmmatlabinner-joinpseudocodematlab-table

Merge the content of two tables (looking for Matlab or Pseudo Code)


THIS QUESTION IS NOT ONLY FOR MATLAB USERS - If you know an answer to the problem in PSEUDOCODE, then feel free to leave your answer as well!


I have two tables Ta and Tb that have different number of rows and different number of columns. The content is all cell text, but maybe in the future it could also contain cell number.

I want to merge the content of these tables together under the following set of rules:

  • Take the value of Ta(i,j) if Tb(i*,j*) is empty and vice versa.
  • If both are available, then take the value of Ta(i,j) (and optionally, check whether they are the same).

The tricky part however is that we do not have unique row keys, we only have unique column keys. Note above that I make a distinction between i* and i. The reason is that the row in Ta can be at a different index than Tb, same holds for the columns j* and j. The implications are:

  • that we first need to identify which row for Ta corresponds to the row of Tb and vice versa. We can do this by trying to crossmatch any of the columns that the tables share in common. However, we might not find a match (in that case we do not merge a row with another one).

Question

How can we merge the content of these two tables together in the most efficient way?


Here are some resources to explain the question in more detail:

1. Matlab example to play with:

Ta = cell2table({...
     'a1', 'b1', 'c1'; ...
     'a2', 'b2', 'c2'}, ...
      'VariableNames', {'A','B', 'C'})
Tb = cell2table({...
     'b2*', 'c2', 'd2'; ...
     'b3', 'c3', 'd3'; ...
     'b4', 'c4', 'd4'}, ...
      'VariableNames', {'B','C', 'D'})

The resulting table Tc should be something like this:

Tc = cell2table({...
    'a1' 'b1' 'c1'   ''; ...
    'a2' 'b2' 'c2' 'd2'; ...
    ''   'b3' 'c3' 'd3'; ...
    ''   'b4' 'c4' 'd4'}, ...
     'VariableNames', {'A', 'B','C', 'D'})

2. A possible first step

I tried the following:

Tc = outerjoin(Ta, Tb, 'MergeKeys', true)

Which works smooth, but the problem is that it lacks the stacking of rows that seem similar. E.g. the above command produces:

 A        B       C       D  
____    _____    ____    ____
''      'b2*'    'c2'    'd2'
''      'b3'     'c3'    'd3'
''      'b4'     'c4'    'd4'
'a1'    'b1'     'c1'    ''  
'a2'    'b2'     'c2'    '' 

Here the rows

''      'b2*'    'c2'    'd2'
'a2'    'b2'     'c2'    '' 

Should have been merged into one:

'a2'    'b2'     'c2'    'd2' 

So we need one more step to stack those two together?


3. Example of a hurdle

If we have something like:

Ta = 
     A        B       C       
    ____    _____    ____
    'a1'    'b1'     'c1' 
    'a2'    'b2'     'c2'

Tb = 
     A        B       C       
    ____    _____    ____
    'a1'    'b2'     'c3' 

then the question arises whether the row in b should be merged with row 1 or row 2 of a or should all rows be merged or just put as a separate row? An idea on how to handle these type of situations would be nice as well.


Solution

  • Here is a conceptual answer, that could get you on the way:

    1. Define a 'scoring function' that tells you per row of Tb how good it matches a row in Ta.
    2. Fill Tc with Ta
    3. For each row in Ta, determine the best match with Tb. If the match quality is above your criterium, define the best match match to be a succesfull match.
    4. If a succesfull match was found, 'consume' it (use the info from Tb to enrich the corresponding row in Tc where needed)
    5. Keep going till you reach the end of Ta, whatever has not been consumed from Tb can now be 'appended' to Tc.

    Room for improvement:

    Note on choice of matches

    Play around with consuming Ta instead of Tb, or use a more complex heuristic to determine the consumption order (e.g. calculate all 'distances' and optimize the matching based on a cost function).

    Note that these improvements are only neccesary if you get a lot of false positives with your matches in the basic solution.

    Note on definition of match quality

    I would recommend you to start very simple with this, for instance if you have 4 fields, simply count how many fields match, or whether all nonempty fields match.

    If you want to go further, consider evaluating how far the values are apart (e.g. mse) or how far the texts are apart (e.g. levensteihn distance).