Search code examples
hadoopcassandrabigdatadatastax-enterprisebrisk

Hadoop and Cassandra to Compare 2 Rows


i have two Rows on a Cassandra ColumnFamily an want to Compare the Values of Columns with the same Columnname, eg:

CF: User

Key: Columns:
......................................................

K1: {Col1: "Andy" V1: "100"} {Col2: "Tom" V2: "100"}

K2: {Col1: "Andy" V1: "120"} {Col2: "Tom" V2: "90"}

Now i want to compare difference K2 Columns With K1 Columns to get this Result in Cassandra:

Key: Columns:
.........................................................................

K1: {Col1: "Andy" V1: "100"} {Col2: "Tom" V2: "100"}

K2: {Col1: "Andy" V1: "120" Diff: 20} {Col2: "Tom" V2: "90" Diff: -10}

At first i want to Code this with Hadoop but i see A Problem that i can#t define two Keys for a Map Process?

Haddop was the choice because it must be a scalable solution.

I hope anyone has an tipp for?

BG, Danny


Solution

  • I dont understand by which row the base of substraction will be represented? K1[V1]-K2[V1] or vice versa?

    Ok, lets say the row with recent timestamp will be a base.

    You Map step should emit the following (K => V):

    // each value is a WritableComparable object to allow sorting by timestamp
    
    "Andy" => {"key":K1, "value":100, timestamp1} 
    "Tom"  => {"key":K1, "value":100, timestamp2} 
    "Andy" => {"key":K2, "value":120, timestamp3} 
    "Tom"  => {"key":K2, "value":90,  timestamp4} 
    

    Reduce step will receive array of pair, for each values are sorted by the timestamp:

    "Andy" => [ {"key":K1, "value":100, timestamp1},
                {"key":K2, "value":120, timestamp3} ]
    
    "Tom"  => [ {"key":K1, "value":100, timestamp2},
                {"key":K2, "value":90,  timestamp4} ]
    

    Now in reduce step you can easly perform a substraction and write necessary columns like "diff" to database