Search code examples
pythondatabasealgorithmsimilarity

Finding same users in different datasets based on their friends and usernames


Let's say there are two databases where people their usernames and their network (followers and following) are stored. Both databases consist of the following tables:

       Users        |      Relation
------------------------------------------
- id (Primary Key)  | - id (Primary Key)
- username (unique) | - follower_id (Foreign key references users id)
                    | - following_id (Foreign key references users id)

Assuming some of the usernames in both datasets are the same, some usernames don't exist in the other database, and some usernames are different but refer to the same person. Also, their network have similar tendencies but may slightly vary.

Are there any known or suggested ways to do some similarity on the usernames and their network to either correlate them to a profile in the other database or to None, if there isn't any?

I'm working in python, but any pseudo-code, general answer or idea of how to approach this issue would suffice.


Solution

  • First you have to use fuzzy search by username in both databases. Let's call this method List getFuzzyName(id) You will get the list of 'probably the same person names' Next use the same fuzzy search for getFuzzyName(follower_id) and getFuzzyName(following_id) and get some number of matching friends. Here getFuzzyName has O(n^2) complexity without optimization, as you need to compare each friend from db1 with each one from db2. The persons from DB1 and DB2 who has this number bigger then some limit is most probably the same person.