I am not familiar enough with MySQL execution plan so I need help to understand and find out how to operate on a subsets of data in MySQL if it is possible. I have two tables:
Table users:
+-----------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------+----------------+
| user_id | int(11) | NO | PRI | NULL | auto_increment |
| msisdn | bigint(20) | NO | UNI | NULL | |
| activation_date | datetime | NO | | NULL | |
| msisdn_type | varchar(32) | NO | | NULL | |
+-----------------+-------------+------+-----+---------+----------------+
Table log_archive:
+-------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| msisdn | bigint(11) | NO | MUL | NULL | |
| msisdn_type | varchar(32) | NO | | NULL | |
| date | date | NO | | NULL | |
| action | varchar(32) | NO | | NULL | |
+-------------+--------------+------+-----+---------+-------+
In table users msisdn is unique but in log_archive it is not.
Here you can find PHP script that will generate you test data for these two tables:
Test data generation script helper
I need to select:
1) All distinct records by msisdn from table log_archive;
2) By earliest date per msisdn for one specific action only;
3) For a specific date range from table log_archive;
4) And to join activation_date from users table with msisdn from both tables.
Let me give you example. Let's say that this is sample data from log_archive table:
+--------------+------------+---------------------+----------------+
| msisdn | date | activation_date | action |
|--------------+------------+---------------------+----------------+
| 977129764170 | 2016-02-11 | 2014-10-07 00:00:00 | all_services |
| 977129764170 | 2015-09-05 | 2014-10-07 00:00:00 | app_start |
| 977129764170 | 2015-05-08 | 2014-10-07 00:00:00 | widget |
| 986629508626 | 2015-07-12 | 2016-02-05 00:00:00 | app_start |
| 986629508626 | 2015-03-02 | 2016-02-05 00:00:00 | number_connect |
| 986629508626 | 2015-05-08 | 2016-02-05 00:00:00 | widget |
| 986629508626 | 2015-01-08 | 2016-02-05 00:00:00 | app_start |
| 933563888440 | 2016-02-20 | 2014-10-06 00:00:00 | all_services |
| 933563888440 | 2015-03-12 | 2014-10-06 00:00:00 | app_start |
| 933563888440 | 2015-04-26 | 2014-10-06 00:00:00 | number_connect |
| 933563888440 | 2015-10-17 | 2014-10-06 00:00:00 | all_services |
| 943730853721 | 2015-06-19 | 2015-05-01 00:00:00 | widget |
| 943730853721 | 2015-12-08 | 2015-05-01 00:00:00 | app_start |
| 943730853721 | 2016-02-09 | 2015-05-01 00:00:00 | app_start |
+--------------+------------+---------------------+----------------+
So distinct msisdns here are 977129764170, 986629508626, 933563888440, 943730853721;
Earliest date for distinct msisdn value where action column equals 'app_start' is:
977129764170 is 2015-09-05
986629508626 is 2015-01-08
933563888440 is 2015-03-12
943730853721 is 2015-06-19
I need to make such SQL that will give me this output:
+--------------+------------+---------------------+----------------+
| msisdn | date | activation_date | action |
|--------------+------------+---------------------+----------------+
| 977129764170 | 2015-09-05 | 2014-10-07 00:00:00 | app_start |
| 986629508626 | 2015-01-08 | 2016-02-05 00:00:00 | app_start |
| 933563888440 | 2015-03-12 | 2014-10-06 00:00:00 | app_start |
| 943730853721 | 2015-12-08 | 2015-05-01 00:00:00 | app_start |
+--------------+------------+---------------------+----------------+
So I need to select all distinct msisdns for earliest date where app_start action occurs and join activation_date from users table by that distinct msisd. And to only look for specific date range from date column.
I tried it with this sql with no results:
SELECT DISTINCT(log_archive.msisdn) as msisdn, DATE(log_archive.date) AS actionDate, users.activation_date
FROM log_archive
INNER JOIN users on log_archive.msisdn = users.msisdn
WHERE log_archive.action = 'app_start' && log_archive.date BETWEEN '2015-01-08' AND '2016-03-15'
ORDER BY actionDate ASC;
I get the same msisdn more than once even though I used DISTINCT.
Do I need to use subquery?
You'll want a GROUP BY
to get MIN(date)
for each msisdn;
SELECT msisdn, MIN(date) date, MIN(action) action
FROM log_archive
WHERE action='app_start'
AND date BETWEEN '2015-01-08' AND '2016-03-15'
GROUP BY msisdn
We also add a MIN(action)
since we should aggregate each field that isn't grouped by and since action is the same for all selected rows, MIN
works well.
Once you have that, adding the join is pretty straight forward;
SELECT a.msisdn, MIN(a.date) date, u.activation_date, MIN(a.action) action
FROM log_archive a
JOIN users u
ON u.msisdn = a.msisdn
WHERE a.action='app_start'
AND a.date BETWEEN '2015-01-08' AND '2016-03-15'
GROUP BY a.msisdn