Intelligent web features, algorithms (people you may follow, similar to you ...)

I have 3 main questions about the algorithms in intelligent web (web 2.0)

Here the book I'm reading http://www.amazon.com/Algorithms-Intelligent-Web-Haralambos-Marmanis/dp/1933988665 and I want to learn the algorithms in deeper

1. People You may follow (Twitter)

How can one determine the nearest result to my requests ? Data mining? which algorithms?

2. How you’re connected feature (Linkedin)

Simply algorithm works like that. It draws the path between two nodes let say between Me and the other person is C. Me -> A, B -> A connections -> C . It is not any brute force algorithms or any other like graph algorithms :)

3. Similar to you (Twitter, Facebook) This algorithms is similar to 1. Does it simply work the max(count) friend in common (facebook) or the max(count) follower in Twitter? or any other algorithms they implement? I think the second part is true because running the loop

 dict{count, person}
 for person in contacts:
        dict.add(count(common(person)))
 return dict(max)

is a silly act in every refreshing page.

4. Did you mean (Google) I know that they may implement it with phonetic algorithm http://en.wikipedia.org/wiki/Phonetic_algorithm simply soundex http://en.wikipedia.org/wiki/Soundex and here is the Google VP of Engineering and CIO Douglas Merrill speak http://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s

What about first 3 questions? Any ideas are welcome !

Thanks

Solution

People who you may follow

You can use the factors based calculations:

factorA = getFactorA(); // say double(0.3)
factorB = getFactorB(); // say double(0.6)
factorC = getFactorC(); // say double(0.8)

result = (factorA+factorB+factorC) / 3 // double(0.5666666666666667)
// if result is more than 0.5, you show this person

So say in the case of Twitter, "People who you may follow" can based on the following factors (User A is the user viewing this "People who you may follow" feature, there may be more or less factors):

Relativity between frequent keywords found in User A's and User B's tweets
Relativity between the profile description of both users
Relativity between the location of User A and B
Are people User A is following follows User B?

So where do they compare "People who you may follow" from? The list probably came from a combination of people with high amount of followers (they are probably celebrities, alpha geeks, famous products/services, etc.) and [people whom User A is following] is following.

Basically there's a certain level of data mining to be done here, reading the tweets and bios, calculations. This can be done on a daily or weekly cron job when the server load is least for the day (or maybe done 24/7 on a separate server).

How are you connected

This is probably a smart work here to make you feel that loads of brute force has been done to determine the path. However after some surface research, I find that this is simple:

Say you are User A; User B is your connection; and User C is a connection of User B.

In order for you to visit User C, you need to visit User B's profile first. By visiting User B's profile, the website already save the info indiciating that User A is at User B's profile. So when you visit User C from User B, the website immediately tells you that 'User A -> User B -> User C', ignoring all other possible paths.

This is the max level as at User C, User Acannot go on to look at his connections until User C is User A's connection.

Source: observing LinkedIN

Similar to you

It's the exact same thing as #1 (People you may follow), except that the algorithm reads in a different list of people. The list of people that the algorithm reads in is the people whom you follow.

Did you mean

Well you got it right there, except that Google probably used more than just soundex. There's language translation, word replacement, and many other algorithms used for the case of Google. I can't comment much on this because it will probably get very complex and I am not an expert to handle languages.

If we research a little more into Google's infrastructure, we can find that Google has servers dedicated to Spelling and Translation services. You can get more information on Google platform at http://en.wikipedia.org/wiki/Google_platform.

Conclusion

The key to highly intensified algorithms is caching. Once you cache the result, you don't have to load it every page. Google does it, Stack Overflow does it (on most of the pages with list of questions) and Twitter not surprisingly too!

Basically, algorithms are defined by developers. You may use others' algorithms, but ultimately, you can also create your own.