I have duplicate profiles due to switching of the identifier in the code. I would like to merge the duplicate profiles now and also merge the events / activity feed.
I got the API working and by calling
deduplicate_people(prop_to_match='$email',merge_props=True,case_sensitive=False,backup=True,backup_file=None)
Duplicates are in fact removed, but the events / activity feed is not merged. So I'd loose many events.
Is there a way to remove duplicates and merging events / activity feed at the same time?
Duplicates happen because some persons use ID and others email as distinct_id due to the change of identifier. The events are referenced by that ID or email to the corresponding person.
So here is what I ended up doing to re-create the identity mapping for people and their events:
I used Mixpanel's API (export_people / export_events) to create a backup of people and events. I wrote a script that creates a mapping "distinct_id <-> email" for people that use an actual ID as distinct_id and not an email (each person has an $email field regardless of the content of the $distinct_id).
Then I went over all exported events. For each event that had an ID as distinct_id I used the mapping to change that distinct_id to email. Updated events were saved in a JSON file. Thus creating the reference from events to person using email as distinct_id -- the events that got lost otherwise.
Then I went ahead and used the de-duplicate API from Mixpanel to delete all duplicates -- thus loosing some events. Now I imported the events from the step before, which gave me back those missing events.
Three open questions to consider before using this approach:
I believe events are not actually deleted on deduplication. So by importing them again there are probably duplicate events in the system that are just not referenced to a person and that may show up at some point.
the deduplication by $email did keep the people that use email as distinct_id and removed the ones with the actual ID. I don't know if this is true every time or may have been a coincidence. My approach will fail for persons that still use ID as distinct_id.
I suppose it's generally discouraged to hack around the distinct_id like that, because making a mistake may result in data loss. So make sure to get it right..