Search code examples
web-scrapingdumpfreebasedata-collection

Getting the list of ALL topic names from Freebase


According to Freebase, they have 23,407,174 topics. What is the easiest way to get the UI friendly names (essentially the 'text' attribute of the topic JSON, example of a single topic JSON is here) of ALL of these TOPICs? I don't need any other meta information.


Solution

  • wget -O - http://download.freebase.com/datadumps/latest/freebase-simple-topic-dump.tsv.bz2 | bunzip2 | cut -f 2 > freebase-topic-names.txt
    

    although you probably want the Freebase IDs as well so that you know what the names refer to:

    wget -O - http://download.freebase.com/datadumps/latest/freebase-simple-topic-dump.tsv.bz2 | bunzip2 | cut -f 1,2
    

    Two additional bits of postprocessing are needed:

    1. Tabs are escaped as \t
    2. The string \N represents a null (non-existent) name