Search code examples
phparraysduplicatespodio

How to find multiple duplicates in a PodioItemCollection response?


I have an array with some students in who have enrolled on a course. There are multiple duplicates and should be only one student per course.

Example array:

'item_id'=> 1, 'student'=> 'Bob', 'course'=> 'Learn Piano', 'address'=>''
'item_id'=> 2, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=> 'foo street'
'item_id'=> 3, 'student'=> 'Bob', 'course'=> 'Learn Guitar', 'address'=>''
'item_id'=> 4, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=>''
'item_id'=> 5, 'student'=> 'Bob', 'course'=> 'Learn Guitar', 'address'=> 'bla bla street'
'item_id'=> 6, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=>''
'item_id'=> 7, 'student'=> 'John', 'course'=> 'Learn Guitar', 'address'=>''

Data is accessed via API (otherwise this whole thing would be a simple SQL query!).

The raw data looks like below:

object(PodioItemCollection)#287 (5) { ["filtered"]=> int(45639) ["total"]=> int(45639) ["items"]=> NULL ["__items":"PodioCollection":private]=> array(10) { [0]=> object(PodioItem)#3 (5) { ["__attributes":"PodioObject":private]=> array(16) { ["item_id"]=> int(319357433) ["external_id"]=> NULL ["title"]=> string(12) "Foo Bar" ["link"]=> string(71) "https://podio.com/foo/enrolments/apps/applications/items/123" ["rights"]=> array(11) ...

The challenge is that I can't just use array_unique or similar because i need to:

  1. Find all the duplicates for a student + course
  2. Evaluate the found duplicates against each other and retain the item with the most amount of supplementary information (or merge them)
  3. Obtain the un-needed "item_id" for the duplicates and use the API to delete the items.

Further constraints:

  • I have no control over the API.
  • There are 44,000 records
  • There could be as many as 100 duplicates per person + course
  • The API returns a nested hierarchy of objects, so 44,000 records uses 27GB of RAM (the server has 144GB to play with) and yes php_memory limit is set to a ridiculous level!!! This is a single project and measures will be taken to correct the server variables afterwards.
  • Because of the large RAM usage things such as array_intersect are going to be a less popular choice

The final output should be:

    'item_id'=> 1, 'student'=> 'Bob', 'course'=> 'Learn Piano', 'address'=>''
    'item_id'=> 2, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=> 'foo street'
    'item_id'=> 5, 'student'=> 'Bob', 'course'=> 'Learn Guitar', 'address'=> 'bla bla street'
    'item_id'=> 7, 'student'=> 'John', 'course'=> 'Learn Guitar', 'address'=>''

But i also need access to 'item_id's 3,4,6 so i can call a delete routine via an API.

Any ideas how to tackle this multi-duplicate mess?


Solution

  • Following function will do the job for you:

    $apiData = array(
       array('item_id'=> 1, 'student'=> 'Bob', 'course'=> 'Learn Piano', 'address'=>''),
       array('item_id'=> 2, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=> 'foo street'),
       array('item_id'=> 3, 'student'=> 'Bob', 'course'=> 'Learn Guitar', 'address'=>''),
       array('item_id'=> 4, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=>''),
       array('item_id'=> 5, 'student'=> 'Bob', 'course'=> 'Learn Guitar', 'address'=> 'bla bla street'),
       array('item_id'=> 6, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=>''),
       array('item_id'=> 7, 'student'=> 'John', 'course'=> 'Learn Guitar', 'address'=>'')
    );
    
    function resolveDuplicate($apiData = null)
    {
      if(!$apiData) return false;
    
      foreach ($apiData as $key => $arr) {
        $key = $arr['student'] . ':' . $arr['course'];
        if(!$newArr[$key]['address']){
           if($newArr[$key]) $itemIds[] = $newArr[$key]['item_id'];
           $newArr[$key] = $arr;
        }
        else{
           $itemIds[] = $arr['item_id'];
        }
      }
    
      if($newArr){
         foreach ($newArr as $value) {
           $finalArr[] = $value;
         }
      }
    
      $result['student']    = $finalArr;
      $result['duplicates'] = $itemIds;
      return $result;
    }
    
    $res = resolveDuplicate($apiData);
    echo '<pre>';
    print_r($res);
    

    Output

    Array
    (
        [student] => Array
            (
                [0] => Array
                    (
                        [item_id] => 1
                        [student] => Bob
                        [course] => Learn Piano
                        [address] => 
                    )
    
                [1] => Array
                    (
                        [item_id] => 2
                        [student] => Sam
                        [course] => Learn Piano
                        [address] => foo street
                    )
    
                [2] => Array
                    (
                        [item_id] => 5
                        [student] => Bob
                        [course] => Learn Guitar
                        [address] => bla bla street
                    )
    
                [3] => Array
                    (
                        [item_id] => 7
                        [student] => John
                        [course] => Learn Guitar
                        [address] => 
                    )
    
            )
    
        [duplicates] => Array
            (
                [0] => 4
                [1] => 3
                [2] => 6
            )
    
    )