Search code examples
rdflinked-dataknowledge-graph

The process of creating knowledge graphs


So I am new to the world of the semantic web, RDF's and Ontologies How does the process of creating knowledge graphs work? Say I want to create knowledge graphs about a specific team and link everything about it the players, trophies and everything how will it go? Do I first scrape data about the team? Do I convert from CSV to RDF triples. And where do Data Science, NLP and Machine Learning fall into all this?


Solution

  • Ok, there are a few components to this. I will take each in turn.

    Part 1:

    So I am new to the world of the semantic web, RDF's and Ontologies How does the process of creating knowledge graphs work? Say I want to create knowledge graphs about a specific team and link everything about it the players, trophies and everything how will it go?

    Some high-level steps:

    1. Design an ontology to represent the knowledge in your knowledge graph. The ontology represents the classes, which will be populated with instances. In your case a class could be players and an instance could be a player in your team. The players class could be linked to the trophies class to show which players have won trophies. This guide might prove useful
    2. Procure data to populate your ontology. I don't have domain knowledge of your area, but web data sounds like it could work.
    3. Find an appropriate database to store your graph. Based on the tags, it sounds like you want to use RDF - Virtuoso, GraphDB and Marklogic all offer free versions you can run locally.
    4. Ingest your data. RDF graphs CRUD operations can be executed using SPARQL. Take a look at the SPARQL INSERT operation. There are also more complex frameworks for turning data into knowledge graphs.

    However, given your use-case I would ignore everything I've written above as this sounds like a solved problem. See, the beauty of RDF is that there is a big community of open data and shared ontologies. It is likely the graph you want to create could at least partially be sourced from existing public graphs which already aggregate and crowd-source data from the web.

    See the SPARQL endpoints:

    Part 2

    Do I first scrape data about the team? Do I convert from CSV to RDF triples.

    I would avoid scraping if you can, and try to rely on the above public graphs that already exist. However, scraping is an option if required.

    Part 3

    And where do Data Science, NLP and Machine Learning fall into all this?

    Increasingly knowledge graphs are being used as part of machine learning workflows. There are a few reasons for this:

    • graphs provide a rich and highly connected web of data. Having more context is generally thought to result in better models as feature variables are richer.
    • data in a graph can be extracted at a specified granularity, so it is possible to solve a variety of downstream use-cases, whilst retaining semantic meaning.
    • This rise of trained models using graph neural networks is fuelling the increasing adoption of knowledge graphs.
    • Modern machine learning requires increasing amounts of data, the likes of which can only be found on the web. RDF has a long-history of aggregating web data in public knowledge graphs.