Search code examples
pythonsqlmysqldistinctdistinct-values

Select one distinct column, others can be non distinct. - MySQL


Ok I have a temporary table with 135000 rows, I'm trying to insert some of the values from this temporary table into other tables.

This is the schema I'm using

tvtemptable

+-------------+-------------+------+-----+---------+-------+
| Field       | Type        | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| PROGTITLE   | text        | YES  |     | NULL    |       |
| SUBTITLE    | text        | YES  |     | NULL    |       |
| EPISODE     | text        | YES  |     | NULL    |       |
| YR          | year(4)     | YES  |     | NULL    |       |
| DIRECTOR    | text        | YES  |     | NULL    |       |
| PERFORMERS  | text        | YES  |     | NULL    |       |
| PREMIERE    | tinyint(1)  | YES  |     | NULL    |       |
| FILM        | tinyint(1)  | YES  |     | NULL    |       |
| RPEAT       | tinyint(1)  | YES  |     | NULL    |       |
| SUBTITLES   | tinyint(1)  | YES  |     | NULL    |       |
| WIDESCREEN  | tinyint(1)  | YES  |     | NULL    |       |
| NEWSERIES   | tinyint(1)  | YES  |     | NULL    |       |
| DEAFSIGNED  | tinyint(1)  | YES  |     | NULL    |       |
| BNW         | tinyint(1)  | YES  |     | NULL    |       |
| STARRATING  | tinyint(4)  | YES  |     | NULL    |       |
| CERTIFICATE | varchar(5)  | YES  |     | NULL    |       |
| GENRE       | varchar(50) | YES  |     | NULL    |       |
| DESCRIPTION | text        | YES  |     | NULL    |       |
| CHOICE      | tinyint(1)  | YES  |     | NULL    |       |
| PROGDATE    | date        | YES  |     | NULL    |       |
| STARTIME    | time        | YES  |     | NULL    |       |
| ENDTIME     | time        | YES  |     | NULL    |       |
| DURATION    | int(11)     | YES  |     | NULL    |       |
| CHANNELID   | int(11)     | NO   |     | NULL    |       |
+-------------+-------------+------+-----+---------+-------+

channels

+-------------+-------------+------+-----+---------+----------------+
| Field       | Type        | Null | Key | Default | Extra          |
+-------------+-------------+------+-----+---------+----------------+
| CHANNELID   | int(11)     | NO   | PRI | NULL    | auto_increment |
| CHANNELNAME | varchar(50) | YES  |     | NULL    |                |
+-------------+-------------+------+-----+---------+----------------+

genre

+-----------+-------------+------+-----+---------+----------------+
| Field     | Type        | Null | Key | Default | Extra          |
+-----------+-------------+------+-----+---------+----------------+
| GENREID   | int(11)     | NO   | PRI | NULL    | auto_increment |
| GENRENAME | varchar(50) | YES  |     | NULL    |                |
+-----------+-------------+------+-----+---------+----------------+

programme

+-------------+------------+------+-----+---------+----------------+
| Field       | Type       | Null | Key | Default | Extra          |
+-------------+------------+------+-----+---------+----------------+
| PROGRAMMEID | int(11)    | NO   | PRI | NULL    | auto_increment |
| GENREID     | int(11)    | NO   | MUL | NULL    |                |
| PROGTITLE   | text       | YES  |     | NULL    |                |
| YR          | year(4)    | YES  |     | NULL    |                |
| DIRECTOR    | text       | YES  |     | NULL    |                |
| PERFORMERS  | text       | YES  |     | NULL    |                |
| FILM        | tinyint(1) | YES  |     | NULL    |                |
| WIDESCREEN  | tinyint(1) | YES  |     | NULL    |                |
| BNW         | tinyint(1) | YES  |     | NULL    |                |
| CERTIFICATE | varchar(5) | YES  |     | NULL    |                |
| DESCRIPTION | text       | YES  |     | NULL    |                |
+-------------+------------+------+-----+---------+----------------+

episode

+-------------+---------+------+-----+---------+----------------+
| Field       | Type    | Null | Key | Default | Extra          |
+-------------+---------+------+-----+---------+----------------+
| EPISODEID   | int(11) | NO   | PRI | NULL    | auto_increment |
| PROGRAMMEID | int(11) | NO   | MUL | NULL    |                |
| SUBTITLE    | text    | YES  |     | NULL    |                |
| EPISODE     | text    | YES  |     | NULL    |                |
| DIRECTOR    | text    | YES  |     | NULL    |                |
| PERFORMERS  | text    | YES  |     | NULL    |                |
| DESCRIPTION | text    | YES  |     | NULL    |                |
+-------------+---------+------+-----+---------+----------------+

channelprogramme

+--------------------+------------+------+-----+---------+----------------+
| Field              | Type       | Null | Key | Default | Extra          |
+--------------------+------------+------+-----+---------+----------------+
| CHANNELPROGRAMMEID | int(11)    | NO   | PRI | NULL    | auto_increment |
| CHANNELID          | int(11)    | NO   | MUL | NULL    |                |
| PROGRAMMEID        | int(11)    | NO   | MUL | NULL    |                |
| EPISODEID          | int(11)    | NO   | MUL | NULL    |                |
| RPEAT              | tinyint(1) | YES  |     | NULL    |                |
| NEWSERIES          | tinyint(1) | YES  |     | NULL    |                |
| PREMIERE           | tinyint(1) | YES  |     | NULL    |                |
| CHOICE             | tinyint(1) | YES  |     | NULL    |                |
| SUBTITLES          | tinyint(1) | YES  |     | NULL    |                |
| DEAFSIGNED         | tinyint(1) | YES  |     | NULL    |                |
| STARRATING         | tinyint(4) | YES  |     | NULL    |                |
| PROGDATE           | date       | YES  |     | NULL    |                |
| STARTTIME          | time       | YES  |     | NULL    |                |
| ENDTIME            | time       | YES  |     | NULL    |                |
| DURATION           | tinyint(4) | YES  |     | NULL    |                |
+--------------------+------------+------+-----+---------+----------------+

So a little bit of background, the database is a TV listings database, the channelprogramme table contains all the entries for every show on every channel, the programme table contains unique entries for every program, episode contains unique entries for every episode of programs that have episodes

I think my confusion is I want to select the distinct progtitles from the temporary table to populate the programme table, but I want some other information to go with it.

For example if I did

select distinct(progtitle) from tvtemptable;

It would only return me the value for the progtitle column, whereas what I actually want is the progtitle plus some other stuff.

So if I try to select more information like so

mysql> select distinct progtitle, yr, director, film from tvtemptable
limit 30;
+-----------------------------------+------+---------------------+------+
| progtitle                         | yr   | director            | film |
+-----------------------------------+------+---------------------+------+
| Teleshopping                      | 2000 |                     |    0 |
| Cinemania                         | 2000 |                     |    0 |
| Whose Line Is It Anyway?          | 2000 |                     |    0 |
| Just Desserts                     | 2004 | Kevin Connor        |    1 |
| Law & Order                       | 2000 | Matthew Penn        |    0 |
| Jane Doe: Yes, I Remember it Well | 2006 | Armand Mastroianni  |    0 |
| CSI: NY                           | 2000 | David Jackson       |    0 |
| CSI: Crime Scene Investigation    | 2000 | Kenneth Fink        |    0 |
| NCIS                              | 2000 | Colin Bucksey       |    0 |
| CSI: Miami                        | 2000 |                     |    0 |
| Enter the Dragon                  | 1973 | Robert Clouse       |    1 |
| Close                             | 2000 |                     |    0 |
| My Son Is Innocent                | 1996 | Larry Elikann       |    1 |
| Law & Order                       | 2000 | Christopher Misiano |    0 |
| Murder 101                        | 2006 | Christian I Nyby II |    1 |
| CSI: NY                           | 2000 | Christine Moore     |    0 |
| CSI: Crime Scene Investigation    | 2000 | Bill Eagles         |    0 |
| Rush Hour                         | 1998 | Brett Ratner        |    1 |
| Dark Blue                         | 2000 | Jeffrey Hunt        |    0 |
| CSI: Crime Scene Investigation    | 2000 | Richard J Lewis     |    0 |
| Ordinary Miracles                 | 2005 | Michael Switzer     |    1 |
| Law & Order                       | 2000 | Jace Alexander      |    0 |
| Wounded Heart                     | 1995 | Vic Sarin           |    1 |
| CSI: NY                           | 2000 | Jonathan Glassner   |    0 |
| Dark Blue                         | 2000 | Nathan Hope         |    0 |
| Blade: The Series                 | 2000 | Michael Robison     |    0 |
| K-Ville                           | 2000 | Kevin Dowling       |    0 |
| Law & Order                       | 2000 | Jim Ellis           |    0 |
| Reasons of the Heart              | 1996 | Rick Jacobson       |    1 |
| CSI: NY                           | 2000 | Anthony Hemingway   |    0 |
+-----------------------------------+------+---------------------+------+

There will be a few duplicates in the progtitle column. The programmes aren't repeated for each new director, there are different directors above for some of the programmes because those programmes have episodes.

Now I've managed to get the select to work properly If I only need one extra column for example

select distinct
    t1.progtitle,
   (select
        t2.director
    from
        tvtemptable t2
    where 
        t1.progtitle = t2.progtitle
    limit 1) as "Director"
from
    tvtemptable t1 limit 10;

+-----------------------------------+--------------------+
| progtitle                         | Director           |
+-----------------------------------+--------------------+
| Teleshopping                      |                    |
| Cinemania                         |                    |
| Whose Line Is It Anyway?          |                    |
| Just Desserts                     | Kevin Connor       |
| Law & Order                       | Matthew Penn       |
| Jane Doe: Yes, I Remember it Well | Armand Mastroianni |
| CSI: NY                           | David Jackson      |
| CSI: Crime Scene Investigation    | Kenneth Fink       |
| NCIS                              | Colin Bucksey      |
| CSI: Miami                        |                    |
+-----------------------------------+--------------------+

Obviously this will get messy If I want to select more than one additional column

So what's the best way to do this?

This is my insert command for populating the programme table from the temporary table, currently it has the same issue with duplicates in the select example I gave above.

INSERT INTO PROGRAMME (
    PROGTITLE, GENREID, YR, DIRECTOR,
    PERFORMERS, FILM, WIDESCREEN, BNW,
    CERTIFICATE, DESCRIPTION)
SELECT DISTINCT
    T.PROGTITLE, G.GENREID, T.YR, T.DIRECTOR,
    T.PERFORMERS, T.FILM, T.WIDESCREEN, T.BNW,
    T.CERTIFICATE, T.DESCRIPTION
FROM
    TVTEMPTABLE T
    INNER JOIN GENRE G ON G.GENRENAME=T.GENRE
    LEFT JOIN PROGRAMME P ON P.PROGTITLE=T.PROGTITLE
WHERE
    P.PROGTITLE IS NULL

Solution

  • Am I right to assume that your TVTEMPTABLE is the result of a join of tables from someplace?

    If that is the case, I think you will have an easier job overall, if you first split that temp table into multiple temp tables that have the same format as your target tables.

    For example (and I don't know your data so I just guess):

    -- All genres (matches your existing genres table)
    create table genres_temp as
       select distinct genre
         from tvtemptable;
    
    -- All programmes (matches your existing programme table)
    create table programmes_temp as
       select distinct all_the_programme_columns
         from tvtemptable;
    
    -- Contains the many-to-many relationship between genres and programmes
    create table programme_genre_temp as 
       select distinct genre, progtitle
         from tvtemptable;
    

    The above doesn't answer your question, but it may give you some ideas that takes away the problem, which seems to be that programme information is repeated for each something. You need to find out what it is.

    Updated So each programme is repeated once for each director? Then my above approach makes even more sense: Extract the entities, and relationships first.

    Updated 2 Hmm, you force-fit director into programme, when director is related to the episode? That would be a modelling error it seems. Anyway, if you don't care which director you pick, you can group on progtitle and use MAX() around all other columns. That will give you distinct progtitles and one of the values for the rest of the columns.

    SELECT T.PROGTITLE, max(G.GENREID), max(T.YR), max(T.DIRECTOR), etc
      FROM TVTEMPTABLE T
     INNER JOIN GENRE G ON G.GENRENAME=T.GENRE
           LEFT JOIN PROGRAMME P ON P.PROGTITLE=T.PROGTITLE
     WHERE P.PROGTITLE IS NULL
     group by T.PROGTITLE;