Search code examples
amazon-web-servicesurlmapreducedistributed-computing

Find a common pattern among a set of URLs


I have a file of URLs. The file looks like this

http://www.example.com/images/1
http://www.example.com/images/2
.
.
.
http://www.example.com/images/2000
http://www.example.org/p/q/r/1/s/t
http://www.example.org/p/q/r/2/s/t
http://www.example.org/p/q/r/3/s/t
.
.
.
http://www.example.org/p/q/r/5000/s/t

and so on. The URLs are not sorted. I have just put it out in sorted manner to explain it clearly.

I have to process these URLs such that if there is one word(word between 2 slashes) different between 2 URLs and number of such occurrences is > 1000, I replace that word by *

For example, in the above file, I will have

http://www.example.com/images/*
http://www.example.org/p/q/r/*/s/t

The file size is in hundreds of GBs. Can someone help me out with this?


Solution

  • We have proposed a MapReduce algorithm for this problem in this paper (end of page 3), which is a parallel adaptation of the sequential "infix extraction" algorithm, introduced in another paper (page 4, Algorithm 1).

    Here, I quote the sequential algorithm:

    1. sort URIs
    2. tokenize URIs at all special characters
    3. cluster URIs according to the first n tokens 
    4. for all clusters do
    5.   for all URIs in the cluster do
    6.     for all possible prefixes do
    7.       find the set of (distinct) next tokens T
    8.     end for
    9.   end for
    10.   for all URIs in the cluster do
    11.     set as a prefix the one with the largest |T|
    12.     set as infix the substring following the prefix
    13.   end for
    14. end for
    

    The main idea of the parallel version is that you create the clusters (step 3), by using as map output key the second token of the URI (after http://), and then perform similar operations to steps 4-14) in each reducer, i.e., in each cluster. The source code of our parallel version can be found here.

    After you extract the "infix" of each URI, you can easily replace it with any character you wish, e.g., '*'. Keep in mind that this is an expensive process, which makes sense to be done in MapReduce, only when you have millions+ of URIs, which it seems that you do have.