Search code examples
csvsparqlrdftriples

Column name with spaces in tarql


I am using tarql (https://github.com/tarql/tarql) - uses sparql syntax - to transform CSV data into RDF triples.

I have a column name "web site". How can i bind to the variable using BIND function? I tried a lot of ways, but i didn't find the solution:

BIND (?web site AS ?homepage)
BIND (?"web site" AS ?homepage)
BIND (?'web site' AS ?homepage)
BIND (?web\ site AS ?homepage)

All leading to parse error.


Solution

  • When you have to deal with complicated situations my suggestion is: first try with an exploratory tests; Let's see by example:

    suppose your source data file is: ./table/table.csv which contains:

    main index;web site;title, to translate
    1;"ciao.ronda.com";"this is the first"
    2;"miao.ronda.it";"this is the second"
    3;"bao.ronda.uk";"this is the third"
    

    step1: explorative test query qstar.sparql:

    SELECT *
      FROM <file:table.csv#delimiter=%3B;>
      WHERE {}
      LIMIT 100
    

    lancher example:

    #!/bin/bash -
    table=./data/table.csv
    query=./data/qstar.sparql 
    ./bin/tarql --test  --delimiter \; --header-row --verbose ${query} ${table} 
    

    result:

     $ ./launcher0.sh
    --------------------------------------------------------
    | main_index | web_site         | title,_to_translate  |
    ========================================================
    | "1"        | "ciao.ronda.com" | "this is the first"  |
    | "2"        | "miao.ronda.it"  | "this is the second" |
    | "3"        | "bao.ronda.uk"   | "this is the third"  |
    --------------------------------------------------------
    

    well now we know the third column variable name computed with these options is: title,_to_translate

    step2: test if the syntax of BIND statement is supported with the proceeds variable name ( title,_to_translate in our example )

    here we need an example BIND based query to understand the problem; suppose this is the query where we try to use out field named: ?title,_to_translate

    SELECT ?homepage ?uri ?title_with_language_tag
      WHERE {
        BIND (?web_site AS ?homepage)
        BIND (URI(CONCAT('http://website.com/ns#', ?main_index)) AS ?uri)
        BIND (STRLANG(?title,_to_translate, 'en') AS ?title_with_language_tag)
      }
    

    result:

     $ ./launcher0.sh
    com.hp.hpl.jena.query.QueryParseException: Lexical error at line 5, column 27.  Encountered: "t" (116), after : "_"
        at org.deri.tarql.TarqlParser.parse(TarqlParser.java:113)
    

    in short this query contains a Lexical error that is not supported by ena.query.QueryParser

    In cases like this, rather than continue to fight with the language, I prefer to adopt a little workaround

    step3: solution with a little workaround

    let's leverage on the option -H --no-header-row CSV file has no header row; use variable names ?a, ?b, ... and enjoy an easy solution; all we need todo is remove the first head row from the content our source data file ( this is an easy task you can pipeline to the process or do in the ways you prefer) for convenience of testing I copied data without the first column in ./data/table0-noheader.csv.

    now the same query become easier for the parser; ./data/query0.sparql:

    SELECT ?homepage ?uri ?title_with_language_tag
      WHERE {
        BIND (?a AS ?homepage)
        BIND (URI(CONCAT('http://website.com/ns#', ?b)) AS ?uri)
        BIND (STRLANG(?c, 'en') AS ?title_with_language_tag)
      }
    

    launcher-noheader.sh:

    !/bin/bash -
    table=./data/table0-noheader.csv
    query=./data/query0.sparql 
    ./bin/tarql --test  --no-header-row --delimiter \; --header-row --verbose ${query} ${table} 
    

    result:

     $ ./launcher-noheader.sh 
    -------------------------------------------------------------------------------
    | homepage | uri                                    | title_with_language_tag |
    ===============================================================================
    | "1"      | <http://website.com/ns#ciao.ronda.com> | "this is the first"@en  |
    | "2"      | <http://website.com/ns#miao.ronda.it>  | "this is the second"@en |
    | "3"      | <http://website.com/ns#bao.ronda.uk>   | "this is the third"@en  |
    -------------------------------------------------------------------------------
    

    Note

    1. the reference docs: Header row, delimiters, quotes and character encoding in CSV/TSV files states all the possible ways and combinations to express options: is a good read worth.

    2. another useful reference could be: Possible names for variables in SPARQL 1.1 Query Language