Search code examples
schema.orgmicrodata

What semantic web identifier should I use?


Mozilla defines the itemid property as:

The itemid global attribute provides microdata in the form of a unique, global identifier of an item.

Is that identifier meant to be unique among other web pages of the same website, among entire World Wide Web or among just about the entire world ?

If so, what is the difference with the identifier property?

Additional context:

For clarification, I will explain the research I did and what I found unclear.

In its definition and background notes, the identifier property mentions some caveats. I think it means it should not be used when the specific subtype used defines a more precise identifier, but I did not understand what it really said.

Sometimes, a Thing seems to be identified by the itemid property defined as a URL:

<meta itemscope itemprop="mainEntityOfPage"  itemType="https://schema.org/WebPage" itemid="https://google.com/article"/>

In that case, it also served as a url, which makes things even more confusing because Schema.org suggests the following code to link to a page describing a Thing:

<div itemscope itemtype="https://schema.org/Person">
  <a href="alice.html" itemprop="url">Alice Jones</a>
</div>

Apparently, url never serves as an identifier. Does that mean that, whenever relevant, I should prefer a combination of mainEntityOfPage and itemid instead of url, and that url should only be used for links related to the Thing, but never for the Thing’s main page?


Solution

  • While Microdata is not an RDF serialization, it’s close to one (see also: Microdata to RDF), and explaining this in RDF terms (using the Turtle serialization for examples) might be easier. You can convert Microdata snippets to Turtle with, for example, Gregg Kellogg’s RDF Distiller.

    With itemid

    An RDF triple consists of a subject, a predicate, and an object. For example:

    <http://dbpedia.org/resource/The_Lord_of_the_Rings> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://schema.org/Book> .
    

    Or with prefixed names:

    @prefix dbr: <http://dbpedia.org/resource/> .
    @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    @prefix schema: <http://schema.org/> .
    
    dbr:The_Lord_of_the_Rings rdf:type schema:Book .
    

    In Microdata, this triple could be encoded like this:

    <div itemid="http://dbpedia.org/resource/The_Lord_of_the_Rings" itemscope itemtype="https://schema.org/Book">
    </div>
    

    What this means: The thing with the IRI http://dbpedia.org/resource/The_Lord_of_the_Rings is a book.

    Without itemid

    Without itemid, you would have a blank node as subject:

    [] rdf:type schema:Book .
    
    <div itemscope itemtype="https://schema.org/Book">
    </div>
    

    What this means: something is a book / a book exists.

    About the subject IRI

    Is that identifier meant to be unique among other web pages of the same website, among entire World Wide Web or among just about the entire world ?

    • The IRI has to be universally unique.

    • The IRI doesn’t have to be a HTTP/HTTPS IRI. Even if it is a HTTP/HTTPS IRI, it doesn’t have to be resolvable on the Web.

    • The IRI has to represent the actual thing, not merely a document about that thing (unless you want to say something about that very document, of course).

      For example, the first IRI represents the intellectual creation of Tolkien, while the second and third IRIs represent documents about his intellectual creation:

      http://dbpedia.org/resource/The_Lord_of_the_Rings
      http://dbpedia.org/page/The_Lord_of_the_Rings
      https://en.wikipedia.org/wiki/The_Lord_of_the_Rings
      

      Saying that https://en.wikipedia.org/wiki/The_Lord_of_the_Rings is a schema:Book would be semantically wrong.

      If you reuse existing IRIs, you have to make sure to use them according to their definition.
      If you mint your own IRIs (under your own domain), you have to make sure to "reserve" them, so that their meaning (= what they represent) doesn’t change.

    Schema.org’s identifier and url properties

    what is the difference with the identifier property?

    While the identifier property could also hold the subject IRI as value, it can of course hold any other kind of identifier as well, and not all of them are IRIs.

    For example, if you sell a product in your webshop, you might want to provide its product ID as a string. And as the IRI https://example.com/products/555#this represents the actual product (instead of the product page), you might want to provide the URL to the product page:

    <https://example.com/products/555#this> 
      rdf:type schema:Product ;
      schema:identifier "555" ; # better use a 'schema:PropertyValue' item here
      schema:url <https://example.com/products/555> .
    

    Now, if you do all this without providing a subject IRI, you get something close in meaning, but with the drawback that it’s harder for data consumers to integrate your data with their data, and with the drawback that you and others can’t easily link to that thing (which is arguably one of the core features of the Semantic Web):

    [] 
      rdf:type schema:Product ;
      schema:identifier <https://example.com/products/555#this> ; # better use a 'schema:PropertyValue' item here
      schema:identifier "555" ; # better use a 'schema:PropertyValue' item here
      schema:url <https://example.com/products/555> .
    

    If it’s possible for you to provide a subject IRI, there is no good reason not to do it.