Search code examples
entityreverse-engineeringfreebasegoogle-knowledge-graph

how to reverse engineer Google's entity ids


Google is using entities everywhere nowadays and they are usually prefixed with /m/ and /g/ (but I have also seen some /t/ lately)

I am wondering how the numbering works. For /m/ there is a schema similar to what an url shortener would do. Define an alphabet (in case of /m/ this is 32 characters "0123456789bcdfghjklmnpqrstvwxyz_" and convert a number to a "short url"

e.g. /m/0 4swd <-> 156524 ("/m/0" seems to be a kind of a prefix)

I am stuck with /g/ IDs though. I created a reasonable alphabet from the IDs I have seen "0123456789bcdfghjklmnpqrstvwxyz_" but I can not get it to work.

Since Google is doing some converting itself so I have one real example: /g/11b6377dzp <-> 576462201963131861

from this: Google Search

But I still can not figure this out.

I am mostly interested in the process how to get a handle on this reverse engineering problem (and of course the result). Any ideas?


Solution

  • You provided the same alphabet for both cases, but your question implies that they are different. That aside, here's a description of the two encoding schemes.

    Quoting from the Freebase developer wiki, here's the encoding for a machine ID:

    The keys of machine-generated ids are short variable-length sequences of characters consisting of digits, lower-case letters excluding vowels, and underscore. ... (By avoiding vowels, we hope to avoid accidently [sic] generating offensive identifiers.) Mids are also URL-safe, i.e. they don't require any escaping or unescaping to be used in URLs.

    The Google Knowledge Graph IDs are in a separate namespace with the prefix "/g/1" as you noticed and their format, according to the relevant Wikidata property page is

    \/g\/1[0-9a-np-z][0-9a-np-z_]{6,8}
    

    so the radix varies by position (no leading underscore allowed) and they chose to only exclude the confusable letter 'o', not all vowels, apparently preferring more encoding space despite the risk of "naughty words."