Search code examples
apache-nifi

How do I configure a NiFi schema to convert all properties to strings when converting from CSV to JSON?


I have flow files with CSV (pipe-delimited) content that I'm converting to JSON. For the benefit of some later processing, I'd like all the JSON properties to be string values. How can I configure either the CSVReader or JSONRecordSetWriter to always output strings?

The inferred schema makes type decisions based on the values that it sees. The CSV files come from different sources with different fields, so I'm trying to avoid having to enumerate all the possible schemas. (I get that if I did that, I could specify type "string".) Is there a way to say that all properties should be strings?


Solution

  • Summary: CSVReader with Schema Access Strategy "Infer Schema" may create a schema with numeric types. CSVReader with Schema Access Strategy "Use String Fields From Header" creates a schema where all fields are string fields. In either case, the field names come from the first row.

    Documentation

    Documentation is at CSVReader Properties table, in the Schema Access Strategy row.

    For "Infer Schema", hovering mouse pointer over its (?) icon shows

    The Schema of the data will be inferred automatically when the data is read. See ... "Additional Details" for information about how the schema is inferred.

    For "Use String Fields From Header", hovering mouse pointer over its (?) icon shows

    The first non-comment line of the CSV file is a header line that contains the names of the columns. The schema will be derived by using the column names in the header and assuming that all column names are of type String.

    Verifying (with Nifi 1.16.3):

    Input file

    id|version|date|time|timestamp|phase
    123456|12.0|2019-12-28|23:58|2019-12-28T23:58:57.000Z|alpha
    123465|12.1|2019-12-29|23:59|2019-12-29T23:59:58.000Z|beta
    

    Flow

    GetFile -success-> ConvertRecord -success-> PutFile -success-> LogAttribute

    ConvertRecord Properties

    Record Reader CSVReader
    Record Writer JsonRecordSetWriter

    JSONRecordSetWriter Properties

    Schema Write Strategy Set 'avro.schema' Attribute
    Schema Access Strategy Inherit Record Schema
    Date Format (No value set)
    Time Format (No value set)
    Timestamp Format (No value set)
    Pretty Print JSON true

    AttributeLog Properties

    Attributes to Log avro.schema

    1. With Schema Access Strategy: "Infer Schema" (and formats set), ...

    CSVReader Properties

    Schema Access Strategy Infer Schema
    Date Format yyyy-MM-dd
    Time Format HH:mm
    Timestamp Format yyyy-MM-dd'T'HH:mm:ss.SSSX
    CSV Format Custom Format
    Value Separator |
    Record Separator \n
    Treat First Line as Header true

    the output JSON values contain unquoted numbers, and dates and times are unquoted integers (timestamp must parse but is kept as string), ...

    [ {
      "id" : 123456,
      "version" : 12.0,
      "date" : 1577509200000,
      "time" : 104280000,
      "timestamp" : "2019-12-28T23:58:57.000Z",
      "phase" : "alpha"
    }, {
      "id" : 123465,
      "version" : 12.1,
      "date" : 1577595600000,
      "time" : 104340000,
      "timestamp" : "2019-12-29T23:59:58.000Z",
      "phase" : "beta"
    } ]
    

    and the log shows that the avro.schema contains nullable numeric types for some columns. (manually prettified)

    ... "fields":[{"name":"id",       "type":["null","int"]},
                  {"name":"version",  "type":["null","float"]},
                  {"name":"date",     "type":["null",{"type":"int","logicalType":"date"}]},
                  {"name":"time",     "type":["null",{"type":"int","logicalType":"time-millis"}]},
                  {"name":"timestamp","type":["null","string"]},
                  {"name":"phase",    "type":["null","string"]}]...
    

    2. With Schema Access Strategy: "Use String Fields From Header", ...

    CSVReader Properties

    Schema Access Strategy Use String Fields From Header
    Date Format (No value set)
    Time Format (No value set)
    Timestamp Format (No value set)
    CSV Format Custom Format
    Value Separator |
    Record Separator \n
    Treat First Line as Header true

    the output JSON values are in quoted strings as desired, ...

    [ {
      "id" : "123456",
      "version" : "12.0",
      "date" : "2019-12-28",
      "time" : "23:58",
      "timestamp" : "2019-12-28T23:58:57.000Z",
      "phase" : "alpha"
    }, {
      "id" : "123465",
      "version" : "12.1",
      "date" : "2019-12-29",
      "time" : "23:59",
      "timestamp" : "2019-12-29T23:59:58.000Z",
      "phase" : "beta"
    } ]
    

    and the log shows the avro.schema contains nullable string types for each column. (manually prettified)

    ... "fields":[{"name":"id",       "type":["null","string"]},
                  {"name":"version",  "type":["null","string"]},
                  {"name":"date",     "type":["null","string"]},
                  {"name":"time",     "type":["null","string"]},
                  {"name":"timestamp","type":["null","string"]},
                  {"name":"phase",    "type":["null","string"]}]...