Search code examples
postgresqlwikipedia

How can I import wikipedia english page titles into a postgresql table?


I downloaded the titles from: https://dumps.wikimedia.org/enwiki/20240420/enwiki-20240420-all-titles-in-ns0.gz

And extracted they are a newline delimited text file. In postgresql I created a table to store the data:

create table titles (
  title varchar(255) primary key
);

And I tried to import it using text or csv format:

thisdb=# COPY titles FROM '/some/path/enwiki-20240401-all-titles-in-ns0' with (format 'text');
ERROR:  duplicate key value violates unique constraint "titles_pkey"
DETAIL:  Key (title)=(//Xekwi_language) already exists.
CONTEXT:  COPY titles, line 23830

But when I did a less search of the file only that one line had that record, which is odd. Maybe this was an escaping issue? Somehow it is seeing the lines:

//Xek\wi_language
//Xekwi_language

As the same entry. When I insert them individually there is no collision, but when I do the format text copy it sees it as a collision... why?

or csv:

thisdb=# COPY titles FROM '/some/path/enwiki-20240401-all-titles-in-ns0' with (format 'csv');
ERROR:  extra data after last expected column
CONTEXT:  COPY titles, line 45: "!Alfaro_Vive,_Carajo!"

Does anyone know how to do this?


Solution

  • Titles are 255 characters max so the field size is not an issue. Somehow this ingestion issue is being caused by escaping that postgres is running on the inputs. Looking at the definition of titles, we see that # < and > are unused characters in titles.

    I was able to get this working with csv ingestion and using the unused characters as delimiters, escaping, and quote characters:

    COPY titles FROM '/some/path/enwiki-20240401-all-titles-in-ns0' with (
        FORMAT 'csv',
        HEADER TRUE,
        DELIMITER '<',
        ESCAPE '>',
        QUOTE '#'
    );