Search code examples
phpxmlsentence-similarity

Combine XML files based on entry similarity


I need to combine differently stuctured XML files using PHP. What I am doing is;

  1. Read first XML file using simplexml_load_file()
  2. Reformat the elements using a new structure using SimpleXMLElement() class
  3. Do the same for the other file, incrementing the first SimpleXMLElement() instance
  4. Save the newly combined XML file.

So far so good. The tricky part is, first file has approx. 3000 entries and the second file has 5000. Nearly 2000 of these entries are actually the same; just maybe a couple of letters differ. Like for example; "Lenovo G50-70 CoreI5" and the other might be "Lenovo G5070 I5".

The question is, how can I match an entry of the first file with the equal entry of the second file; so that actually it happens to be only one entry in total, in the new combined file?

I am using both similar_text() function of PHP and SmithWatermanGotoh to calculate similarity and it mathes with a score of 86%; which is enough for me. But iterating all of the entries of the other file to match only one entry comes quite unwise and resource consuming to me. Beucase it means approx. 7MBs of file loaded into memory doing a minimum of 15.000 iterations each time I am saving a new updated file.

I consider inserting all entries to a database table and use Sphinx Search to match entries; but I am not sure if it really helps enough.


Solution

  • The best approach I could see is using a custom callback with array_uintersect() function. This way works in steps like;

    1- Write a comparing function that will calculate the similarity. Check array_uintersect() manual from php.net to have an idea about how you need to write this callback function. Say it's name would be find_similar_entries()

    2- Collect both entries from different XML files into two arrays repectively. (For a quick way, do a json_encode()first and then json_decode()back.)

    3- Have intersection function find the similar entries like; $similar_products = array_uintersect($xml_array1, $xml_array2, 'find_similar_entries');

    4- Now you have similar entries collected in one array.

    5- Call array_diff() to remove similar entries from the original arrays.

    6- Finally combine all three arrays into a new XML structure per your wish, using SimpleXMLElement() class.

    Note1: I used similar_text() and SmithWatermanGotoh to calculate the similarity and they work well together I can say. But when it comes to very close product names which may differ only a few chars from each other, they would end up "identical". There is nothing you can do about it except extracting the distinguishing words from the strings. Like "model name" in my case.

    Note2: This method works as expected but PHP's intersection functions have a bug I think, which makes these function so slow. I created a bug report for that. Intersection compares not the elements of two arrays cross wise only; but it also compares the array's own elements too. This is actually illogical because intersection can be calculated only by comparing at least two parties. So comparing one array from the inside is not actually "intersection". This is why if you have large files, your script will die if you just run this straight forward. Maybe you can do it chunk by chunk.