Search code examples
cassandratuplescqlshdsbulk

What is the correct CSV format for tuples when loading data with DSBulk?


I recently started using Cassandra for my new project and doing some load testing.

I have a scenario where I’m doing dsbulk load using CSV like this,

$ dsbulk load -url <csv path> -k <keyspace> -t <table> -h <host> -u <user> -p <password> -header true -cl LOCAL_QUORUM

My CSV file entries looks like this,

userid birth_year created_at                  freq
1234   1990       2023-01-13T23:27:15.563Z    {1234:{"(1, 2)": 1}}

Column types,

userid bigint PRIMARY KEY,
birth_year int,
created_at timestamp,
freq map<bigint, frozen<map<frozen<tuple<tinyint, smallint>>, smallint>>>

The issue is, for column freq, I try different ways of setting the value in csv like below, but not able to insert the row using dsbulk

  1. Let’s say if I set freq as {1234:{[1, 2]: 1}}, com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: Could not map field freq to variable freq; conversion from Java type java.lang.String to CQL type Map(BIGINT => Map(Tuple(TINYINT, SMALLINT) => SMALLINT, not frozen), not frozen) failed for raw value: {1234:{[1,2]: 1}} Caused by: java.lang.IllegalArgumentException: Could not parse ‘{1234:{[1, 2]: 1}}’ as Json Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character (‘[’ (code 91)): was expecting either valid name character (for unquoted name) or double-quote (for quoted) to start field name at [Source: (String)“{1234:{[1, 2]: 1}}“; line: 1, column: 9]

  2. If I set freq as {\"1234\":{\"[1, 2]\":1}},
    java.lang.IllegalArgumentException: Expecting record to contain 4 fields but found 5.

  3. If I set freq as {1234:{"[1, 2]": 1}} or {1234:{"(1, 2)": 1}},
    Source: 1234,80,2023-01-13T23:27:15.563Z,“{1234:{“”[1, 2]“”: 1}}” java.lang.IllegalArgumentException: Expecting record to contain 4 fields but found 5.

But in COPY FROM TABLE command, the value for freq {1234:{[1, 2]:1}} inserts into DB without any error, the value in DB looks like this {1234: {(1, 2): 1}}

I guess the JSON not accepting array(tuple) as key when I try with dsbulk? Can someone advise me what’s the issue and how to fix this? Appreciate your help.


Solution

  • When loading data using the DataStax Bulk Loader (DSBulk), the CSV format for CQL tuple type is different from the format used by the COPY ... FROM command because DSBulk uses a different parser.

    Formatting the CSV data is particularly challenging in your case because the column contains multiple nested CQL collections.

    InvalidMappingException

    The JSON parser used by DSBulk doesn't accept parentheses () when enclosing tuples. It also expects tuples to be enclosed in double quotes " otherwise you'll get errors like:

    com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: \
      Could not map field ... to variable ...; \
      conversion from Java type ... to CQL type ... failed for raw value: ...
       ...
    Caused by: java.lang.IllegalArgumentException: Could not parse '...' as Json
       ...
    Caused by: com.fasterxml.jackson.core.JsonParseException: \
      Unexpected character ('(' (code 91)): was expecting either valid name character \
      (for unquoted name) or double-quote (for quoted) to start field name
       ...
    

    IllegalArgumentException

    Since values for tuples contain a comma (,) as a separator, DSBulk incorrectly parses the rows and it thinks each row contains more fields than expected and throws an IllegalArgumentException, for example:

    java.lang.IllegalArgumentException: Expecting record to contain 2 fields but found 3.
    

    Solution

    Just to make it easier, here is the schema for the table I'm using as an example:

    CREATE TABLE inttuples (
        id int PRIMARY KEY,
        inttuple map<frozen<tuple<tinyint, smallint>>, smallint>
    )
    

    In this example CSV file, I've used the pipe character (|) as a delimiter:

    id|inttuple
    1|{"[2,3]":4}
    

    Here's another example that uses tabs as the delimiter:

    id      inttuple
    1       {"[2,3]":4}
    

    Note that you will need to specify the delimiter with either -delim '|' or -delim '\t' when running DSBulk. Cheers!


    👉 Please support the Apache Cassandra community by hovering over the tag then click on the Watch tag button. 🙏 Thanks!