Search code examples
phpmysqldata-comparison

How to compare data with varying number of values


I'll try to explain my problem as well as I can.

I have to compare data of same type, saved in a MySql database but containing a varying number of values.

I build my database like this (probably not the best):

--
-- Table structure for table `amount`
--

CREATE TABLE `amount` (
  `id` tinyint(1) UNSIGNED NOT NULL,
  `value` tinyint(1) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

--
-- Dumping data for table `amount`
--

INSERT INTO `amount` (`id`, `value`) VALUES
(34, 1),
(22, 2),
(30, 6),
(21, 7),
(9, 8),
(17, 9),
(10, 10),
(15, 11),
(3, 12),
(4, 13),
(8, 14),
(5, 15),
(16, 16),
(13, 17),
(6, 18),
(20, 19),
(7, 20),
(23, 21),
(18, 22),
(19, 23),
(24, 24),
(14, 25),
(25, 26),
(26, 27),
(28, 28),
(29, 29),
(11, 30),
(27, 31),
(12, 32),
(31, 33),
(32, 35),
(33, 36),
(2, 98),
(1, 99);

-- --------------------------------------------------------

--
-- Table structure for table `mark`
--

CREATE TABLE `mark` (
  `id` tinyint(1) UNSIGNED NOT NULL,
  `name` varchar(16) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

--
-- Dumping data for table `mark`
--

INSERT INTO `mark` (`id`, `name`) VALUES
(1, 'A'),
(2, 'B'),
(3, 'C'),
(4, 'D'),
(5, 'E'),
(6, 'F'),
(7, 'G'),
(8, 'H'),
(9, 'I'),
(10, 'J')
(11, 'K')
(12, 'L')
(13, 'M')
(14, 'N')
(15, 'O');

-- --------------------------------------------------------

--
-- Table structure for table `profile`
--

CREATE TABLE `profile` (
  `id` smallint(2) UNSIGNED NOT NULL,
  `run` smallint(2) NOT NULL,
  `deleted` datetime DEFAULT NULL,
  `created` datetime NOT NULL,
  `validated` datetime DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

-- --------------------------------------------------------

--
-- Table structure for table `profile_mark`
--

CREATE TABLE `profile_mark` (
  `id` int(11) NOT NULL,
  `id_profile` smallint(2) UNSIGNED NOT NULL,
  `id_mark` tinyint(1) UNSIGNED NOT NULL,
  `id_amount` tinyint(1) UNSIGNED NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Each time new data arrive, a profile is created and if necessary a new mark and a new amount.

  • A profile have a fixed number of mark (9, 16, 24)
  • For each profile, each of its mark can have 0, 1 or 2 amount

It means a profile can have up to 48 values. I plan to have at least 20000 profile in database in the future.

My Goal: If I take one profile, I must find all other profiles which have at least 1 common value for X of their marks. (where X is the number of marks minimum which must match)

At the moment, I take all profiles one by one to compare it to the tested one. It takes a some time (I have only about 50 profile at the moment in database) and it's not a good solution for the future of my application.

Another solution I imagine is to cache (or save in database), all profile id for each mark_amount association... but It seems not a good idea :(

I need some advise to optimize this comparison please. (I'm open to other database, cache system than php/mysql etc...)

EDIT1: example of profile matching or not on 8 marks

https://jsfiddle.net/gafy2w4k/


Solution

  • The query to return all profile_mark.id_profile that have exactly @matched_marks marks with at least 1 same amount as the profile with the given @target_profile_id:

    SELECT `match`.id_profile, count(*) as X FROM (
        SELECT DISTINCT `all`.id_profile, `all`.id_mark FROM profile_mark as `all`
        INNER JOIN profile_mark as `one` 
          ON `one`.id_mark = `all`.id_mark 
          AND `one`.id_amount = `all`.id_amount
        WHERE `all`.id_profile <> @target_profile_id
          AND `one`.id_profile = @target_profile_id
    ) as `match`
    GROUP BY 1
    HAVING X = @matched_marks; // can be >= if you need at least X matching marks
    

    As a side note, id_profile smallint(2) seems insufficient for at least 20000 profiles.