Search code examples
mysqlgroup-bysubqueryleft-joinlarge-data

How to optimize MySQL queries with Joins and subqueries for large datasets (millions of rows)


I am attempting to join four large tables (35-200 million rows) of the international patent database (PATSTAT) into a top 15 of most cited patents matching a few requirements.

A first table (t9) lists citations from one group (family) of applications to another. Another table (t1) basically links everything together as is contains both family and application id's, and filing years Tables t2 and tls209_appln_ipc have are used to identify the appln_id's to include.

The code that I have finally come to is the following:

SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t3.appln_id
FROM docdb_family_citation t9 
LEFT JOIN 
(SELECT
t1.appln_id, t1.docdb_family_id from tls201_appln t1
LEFT JOIN tls204_appln_prior t2 on t1.appln_id=t2.appln_id 
WHERE
t1.appln_filing_year BETWEEN 2010 AND 2015
AND
t2.appln_id IS NULL
AND
t1.appln_id IN (SELECT distinct appln_id from tls209_appln_ipc where ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J"))) t3 ON t9.cited_docdb_family_id=t3.docdb_family_id
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15

The problem is that the query ran in the online web-based interface of PATSTAT does not converge before my session times out. Are there ways to improve the efficiency of this query?

-Edit-
The tls_209_appln_ipc contains 195 million rows of appln_id coupled with ipc_subclass_symbol. appln_id's may occur zero or more times in this table. In my query I only need the docdb_family_ids if any of their linked appln_ids is linked to any of the ipc_subclass_symbols I listed.


Solution

  • With help from the earlier answers, final code that gave the result I was looking for:

    SELECT t9.cited_docdb_family_id, t99.cited AS cited, t1.appln_id, t1.appln_nr_epodoc
            FROM docdb_family_citation t9 
    INNER JOIN (SELECT cited_docdb_family_id, count(cited_docdb_family_id) as cited FROM docdb_family_citation GROUP BY cited_docdb_family_id) t99 
    ON t9.cited_docdb_family_id = t99.cited_docdb_family_id
    LEFT JOIN 
         tls201_appln t1
         on t9.cited_docdb_family_id = t1.docdb_family_id 
         WHERE
            t1.appln_filing_year BETWEEN 2010 AND 2015 and
            exists (select 1 from tls209_appln_ipc t209
                    where t209.appln_id = t1.appln_id
                      and    t209.ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J")
                   ) and
            not exists (select 1 from tls204_appln_prior t2
                        where t1.appln_id = t2.appln_id 
                       )
    GROUP BY t9.cited_docdb_family_id
    ORDER BY cited DESC
    LIMIT 15;`
    

    note that the join with subquery t99 is used to get the correct cited count