Search code examples
ruby-on-railsrubydatabasedata-analysis

Group records for data analysis in Rails


I have two tables connected with habtm relation (through a table).

Table1
id  : integer
name: string

Table2
id  : integer
name: string

Table3
id       : integer
table1_id: integer
table2_id: integer

I need to group Table1 records by simmilar records from Table2. Example:

userx = Table1.create()
user1.table2_ids = 3, 14, 15
user2.table2_ids = 3, 14, 15, 16
user3.table2_ids = 3, 14, 16
user4.table2_ids = 2, 5, 7
user5.table2_ids = 3, 5

Result of grouping that I want is something like

=> [ [ [1,2], [3, 14, 15] ], [ [2,3], [3,14, 16] ], [ [ 1, 2, 3, 5], [3] ]  ]

Where first array is an user ids second is table2_ids. I there any possible SQL solution or I need to create some kind of algorithm ?

Updated: Ok, I have a code that is working like I've said. Maybe someone who can help me will find it useful to understand my idea.

def self.compare
    hash = {}
    Table1.find_each do |table_record|
      Table1.find_each do |another_table_record|
        if table_record != another_table_record
          results = table_record.table2_ids & another_table_record.table2_ids
          hash["#{table_record.id}_#{another_table_record.id}"] = results if !results.empty?
        end
      end
    end
    #hash = hash.delete_if{|k,v| v.empty?}
    hash.sort_by{|k,v| v.count}.to_h
  end

But I can bet that you can imagine how long does it takes to show me an output. For my 500 Table1 records it's something near 1-2 minutes. If I will have more, time will be increased in progression, so I need some elegant solution or SQL query.


Solution

  • Table1.find_each do |table_record|
      Table1.find_each do |another_table_record|
        ...
    

    Above codes have performance issue that you have to query database N*N times, which could be optimized down to one single query.

    # Query table3, constructing the data useful to us
    # { table1_id: [table2_ids], ... }
    records = Table3.all.group_by { |t| t.table1_id }.map { |t1_id, t3_records|
        [t1_id, t3_records.map(&:table2_id)]
      }.to_h
    

    Then you could do exactly the same thing to records to get the final result hash.

    UPDATE:

    @AKovtunov You miss understood me. My code is the first step. With records, which have {t1_id: t2_ids} hash, you could do sth like this:

    hash = {}
    records.each do |t1_id, t2_ids|
      records.each do |tt1_id, tt2_ids|
        if t1_id != tt1_id
          inter = t2_ids & tt2_ids
          hash["#{t1_id}_#{tt1_id}"] = inter if !inter.empty?
        end
      end
    end