Search code examples
mysqlsqlwikimediawikimedia-dumps

Select rows based on information stored in separate table


First of all I'm sorry for the overly vague title, however I'm unfamiliar with the proper terminology for a problem like this.

I'm attempting to retrieve a list of page titles from Wiktionary (Wikimedia wiki-based dictionary) where the page must be categorized under English_lemmas, but must not be categorized under English_phrases, English_slang, and English_%_forms (where % is the wildcard).

The two necessary tables are page which contains page information (page_id, page_title), and categorylinks which contains the categories that each page is categorized under.

The relevant structure of page is like so:

+---------+------------------+
| page_id | page_title       |
+---------+------------------+
| 0       | race             |
| 1       | racing           |
| 2       | that's all folks |
| 3       | fire             |
| 4       | psychédélique    |
+---------+------------------+

and the relevant structure of categorylinks is like so: (I've added line breaks for easier reading)

+---------+-------------------------+
| cl_from | cl_to                   |
+---------+-------------------------+
| 0       | English_lemmas          |
| 0       | English_verbs           |

| 1       | English_lemmas          |
| 1       | English_verbs           |
| 1       | English_non-lemma_forms |

| 2       | English_lemmas          |
| 2       | English_phrases         |

| 3       | English_lemmas          |
| 3       | English_nouns           |

| 4       | French_lemmas           |
| 4       | French_adjectives       |
+---------+-------------------------+

where categorylinks.cl_from is a direct reference to page.page_id and categorylinks.cl_to is the title of the category.

I need to select race and fire as they are both categorized under English_lemmas, but not racing or that's all folks as in addition to both being categorized under English_lemmas they are also categorized under English_%_forms and English_phrases respectively, and not psychédélique as it is not categorized under English_lemmas.

Therefore the ideal result would be

+---------+------------------+
| page_id | page_title       |
+---------+------------------+
| 0       | race             |
| 3       | fire             |
+---------+------------------+

Is this something that is feasible to achieve efficiently? And if so how can I achieve this?

I have a fairly basic understanding of SQL (basic SELECT, UPDATE, etc statements), so something like this is far beyond my comprehension.


Solution

  • If I understood your requirements correctly:

    select *
      from page
      where page_id not in (select cl_from from categorylinks
                             where cl_to like 'English_%_forms'
                                or cl_to like 'English_phrases')
        and page_id in (select cl_from from categorylinks
                          where cl_to like 'English_lemmas')