Search code examples
javascriptfreebasemql

freebase how to extract all companies detailed information?


i want to extract all the companies detailed information from freebase. i tried to do that using mql queries. But it is never returning me more than 4100 records. i have also tried using cursors also but with cursors also i am able to get same number of records.

I have googled it and some people are suggesting to download the dump and than extract the information. Is it the only way? if yes then how to get following info from the dump. Any help is highly appreciated.

[
  {
    "type": "/business/company",
    "name": null,
    "parent_company": [{}],
    "products": [].
    "industry": [].
    "founded": null,
    "net_income": [
      {
        "amount": null,
        "valid_date": null,
        "currency": null
      }
    ],
    "company_type": [],
    "headquarters": [{}],
    "number_of_employees": [{}],

    "/base/schemastaging/organization_extra/phone_number": [{}]
  }
]

Solution

  • First, the obligatory warning. Freebase has been read-only for many months and will soon be shut down. The data there is stale.

    I get a count of 4189 for that query, so it sounds like you're pretty close the the expected results. On the other hand, there are over 400K businesses in Freebase, so perhaps you don't really intend to limit your query to only those which have net income information. If that's the case, you can modify your query by adding "optional": true to that clause of the query. ie

      "net_income": [{
        "amount": null,
        "valid_date": null,
        "currency": null,
        "optional": true
      }],
    

    Having said that, 400K is an awful lot to query through the API. To get the same information from the Freebase data dump, just filter for the same properties you've included in your query.

    Note that there's been some significant refactoring of this schema over the years, so some of the things in your query aren't the currently preferred property names, but rather older aliases. For example, the current name for /business/company is /business/business_operation and /business/company/founded is really just an alias for /organization/organization/date_founded, so that's what you'd want to look for in the dump.

    In the dump, all slashes (/) are replaced with dots (.), so you can filter using zgrep commands like these:

    $ zgrep "organization\.organization.\parent" freebase-rdf-2015-04-19-00-00.gz
    <http://rdf.freebase.com/ns/m.010b0njl> <http://rdf.freebase.com/ns/organization.organization.parent>   <http://rdf.freebase.com/ns/m.010d_x4z> .
    <http://rdf.freebase.com/ns/m.010qw9c3> <http://rdf.freebase.com/ns/organization.organization.parent>   <http://rdf.freebase.com/ns/m.0110pjfc> .
    
    $ zgrep "business\.business_operation\.industry" freebase-rdf-2015-04-19-00-00.gz
    <http://rdf.freebase.com/ns/m.010b2kgs> <http://rdf.freebase.com/ns/business.business_operation.industry>   <http://rdf.freebase.com/ns/m.0c5mq>    .
    <http://rdf.freebase.com/ns/m.010h6tq9> <http://rdf.freebase.com/ns/business.business_operation.industry>   <http://rdf.freebase.com/ns/m.02y_9m3>  .
    

    For mediators or CVTs, there will be a separate line for each piece of the mediator. So, for example, a name change might look like this:

    <http://rdf.freebase.com/ns/m.0q2g4kt>  <http://rdf.freebase.com/ns/business.company_name_change.end_date>  "2004"^^<http://www.w3.org/2001/XMLSchema#gYear>    .
    <http://rdf.freebase.com/ns/m.0q2g4kt>  <http://rdf.freebase.com/ns/business.company_name_change.company>   <http://rdf.freebase.com/ns/m.06_dbm>   .
    <http://rdf.freebase.com/ns/m.0q2g4kt>  <http://rdf.freebase.com/ns/business.company_name_change.start_date>    "1974"^^<http://www.w3.org/2001/XMLSchema#gYear>    .
    <http://rdf.freebase.com/ns/m.0q2g4kt>  <http://rdf.freebase.com/ns/business.company_name_change.new_name>  "Cinar"@en  .