Search code examples
mergeduplicatesmixpanel

Mixpanel: Merge duplicate people profiles and also merge events


I have duplicate profiles due to switching of the identifier in the code. I would like to merge the duplicate profiles now and also merge the events / activity feed.

I got the API working and by calling

deduplicate_people(prop_to_match='$email',merge_props=True,case_sensitive=False,backup=True,backup_file=None)

Duplicates are in fact removed, but the events / activity feed is not merged. So I'd loose many events.

Is there a way to remove duplicates and merging events / activity feed at the same time?


Solution

  • Duplicates happen because some persons use ID and others email as distinct_id due to the change of identifier. The events are referenced by that ID or email to the corresponding person.

    So here is what I ended up doing to re-create the identity mapping for people and their events:

    I used Mixpanel's API (export_people / export_events) to create a backup of people and events. I wrote a script that creates a mapping "distinct_id <-> email" for people that use an actual ID as distinct_id and not an email (each person has an $email field regardless of the content of the $distinct_id).

    Then I went over all exported events. For each event that had an ID as distinct_id I used the mapping to change that distinct_id to email. Updated events were saved in a JSON file. Thus creating the reference from events to person using email as distinct_id -- the events that got lost otherwise.

    Then I went ahead and used the de-duplicate API from Mixpanel to delete all duplicates -- thus loosing some events. Now I imported the events from the step before, which gave me back those missing events.

    Three open questions to consider before using this approach:

    1. I believe events are not actually deleted on deduplication. So by importing them again there are probably duplicate events in the system that are just not referenced to a person and that may show up at some point.

    2. the deduplication by $email did keep the people that use email as distinct_id and removed the ones with the actual ID. I don't know if this is true every time or may have been a coincidence. My approach will fail for persons that still use ID as distinct_id.

    3. I suppose it's generally discouraged to hack around the distinct_id like that, because making a mistake may result in data loss. So make sure to get it right..