How to automate the creation of classes in the ontology?

I have CSV file with the followig data (a small extract):

ITEM_ID FAMILY      SUBFAMILY
555     Adventure   Adventure and extreme sports
444     Nightlife   International restaurants
333     Adventure   Adventure and extreme sports

I have also an ontology in owl format that I created in Protégé. I know that it is possible to load CSV file into the ontology, if I already have all the classes created (i.e. "Adventure", "Nightlife", etc. from FAMILY, and "Adventure and extreme sports", "International restaurants", etc. from SUBFAMILY). To do this, I can use SPARQL to load items (ITEM_ID) as instances.

However my question is if I can also automate the creation of classes using SPARQL and CSV? The idea is to avoid manual creation of thousands of classes based on FAMILY and SUBFAMILY stored in CSV.

Solution

There are indeed many ways to do that. Here's a couple I have used so far:

1. OpenRefine with RDF plug-in

For one-off exercises my preferred option is to use Open Refine (former Google Refine)

You can import your ontology, along with others, and use them to give sense to the data. You choose your root node. If you don't have unique ID, you can generate them in additional column. Root node you have to treat as URI and you can type it (assign one or more classed from your and other ontologies). Then you choose which properties from your ontology should be mapped to the headers of the CSV, and all cells of each column will become automatically: the object of the triple patter you modelled, of the type you have chosen, and treated as URI, text, date etc - there is a good number of options to choose from. Then you just export as RDF/XML or RDF/Turtle.

If your CSV file is very big, you can increase the memory of Open Refine. So far I've managed to convert CSVs with about half million rows, as the column were quite many, the resulting file had huge number of triples.

However, if you are using large CSV files, Protégé will either not open them or will work extremely slow.

2. Virtuoso CSV spongers

There are several options there, including using Open Data Spaces, as well as R2RML.