I am attempting to join four large tables (35-200 million rows) of the international patent database (PATSTAT) into a top 15 of most cited patents matching a few requirements.
A first table (t9
) lists citations from one group (family) of applications to another.
Another table (t1
) basically links everything together as is contains both family and application id's, and filing years
Tables t2
and tls209_appln_ipc
have are used to identify the appln_id
's to include.
The code that I have finally come to is the following:
SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t3.appln_id
FROM docdb_family_citation t9
LEFT JOIN
(SELECT
t1.appln_id, t1.docdb_family_id from tls201_appln t1
LEFT JOIN tls204_appln_prior t2 on t1.appln_id=t2.appln_id
WHERE
t1.appln_filing_year BETWEEN 2010 AND 2015
AND
t2.appln_id IS NULL
AND
t1.appln_id IN (SELECT distinct appln_id from tls209_appln_ipc where ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J"))) t3 ON t9.cited_docdb_family_id=t3.docdb_family_id
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15
The problem is that the query ran in the online web-based interface of PATSTAT does not converge before my session times out. Are there ways to improve the efficiency of this query?
-Edit-
The tls_209_appln_ipc
contains 195 million rows of appln_id
coupled with ipc_subclass_symbol
. appln_id
's may occur zero or more times in this table. In my query I only need the docdb_family_id
s if any of their linked appln_id
s is linked to any of the ipc_subclass_symbol
s I listed.
With help from the earlier answers, final code that gave the result I was looking for:
SELECT t9.cited_docdb_family_id, t99.cited AS cited, t1.appln_id, t1.appln_nr_epodoc
FROM docdb_family_citation t9
INNER JOIN (SELECT cited_docdb_family_id, count(cited_docdb_family_id) as cited FROM docdb_family_citation GROUP BY cited_docdb_family_id) t99
ON t9.cited_docdb_family_id = t99.cited_docdb_family_id
LEFT JOIN
tls201_appln t1
on t9.cited_docdb_family_id = t1.docdb_family_id
WHERE
t1.appln_filing_year BETWEEN 2010 AND 2015 and
exists (select 1 from tls209_appln_ipc t209
where t209.appln_id = t1.appln_id
and t209.ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J")
) and
not exists (select 1 from tls204_appln_prior t2
where t1.appln_id = t2.appln_id
)
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15;`
note that the join with subquery t99
is used to get the correct cited
count