I have a file of URLs. The file looks like this
http://www.example.com/images/1 http://www.example.com/images/2 . . . http://www.example.com/images/2000 http://www.example.org/p/q/r/1/s/t http://www.example.org/p/q/r/2/s/t http://www.example.org/p/q/r/3/s/t . . . http://www.example.org/p/q/r/5000/s/t
and so on. The URLs are not sorted. I have just put it out in sorted manner to explain it clearly.
I have to process these URLs such that if there is one word(word between 2 slashes) different between 2 URLs and number of such occurrences is > 1000, I replace that word by *
For example, in the above file, I will have
http://www.example.com/images/* http://www.example.org/p/q/r/*/s/t
The file size is in hundreds of GBs. Can someone help me out with this?
We have proposed a MapReduce algorithm for this problem in this paper (end of page 3), which is a parallel adaptation of the sequential "infix extraction" algorithm, introduced in another paper (page 4, Algorithm 1).
Here, I quote the sequential algorithm:
1. sort URIs
2. tokenize URIs at all special characters
3. cluster URIs according to the first n tokens
4. for all clusters do
5. for all URIs in the cluster do
6. for all possible prefixes do
7. find the set of (distinct) next tokens T
8. end for
9. end for
10. for all URIs in the cluster do
11. set as a prefix the one with the largest |T|
12. set as infix the substring following the prefix
13. end for
14. end for
The main idea of the parallel version is that you create the clusters (step 3), by using as map output key the second token of the URI (after http://), and then perform similar operations to steps 4-14) in each reducer, i.e., in each cluster. The source code of our parallel version can be found here.
After you extract the "infix" of each URI, you can easily replace it with any character you wish, e.g., '*'
. Keep in mind that this is an expensive process, which makes sense to be done in MapReduce, only when you have millions+ of URIs, which it seems that you do have.