Search code examples
hadoopheadersimilaritycross-product

Cross product and reading headers in hadoop


i have some hadoop document similarity project that i'm working on, and i'm stuck in some part. The situation looks like this(I have a document term index table stored in a csv file

"", t1,t2,t3,t4,....


doc1,f11,f12,f13,f14,....

doc2,f21,f22,f23,f24,....

doc3,f31,f32,f33,f34,....

.

.

.

where f12 means the frequency of term2(t2) in document1(doc1)

On the other hand, I have a query file contains the queries that need to be searched for their nearest or similar documents

"", t1,t3,t122,t34,....


q1,f11,f12,f13,f14,....

q2,f21,f22,f23,f24,....

q3,f31,f32,f33,f34,....

.

.

. but here the terms here may contains different terms, so i need to find the cross product of these two (term index and query) in order to find the distances between the query and th existing document

The problem contains two parts: first, how to read the headers of each of these csv files to store them in some termvector considering the file will be splitted into different machines.

second, how to make the cross product on these two files, in order to find the similartiy(create a new document that can have all the possible terms(dimensions) in order to find the similarity)

I'm planning to write some K-nearest neighbour algorithm to find the similarity Which tool or tools should i use, Pig,Hive,Mahout.


Solution

  • there is a separate chapter on the book MapReduce Design Patterns on Cartesian product, with source code given.