How do I configure a NiFi schema to convert all properties to strings when converting from CSV to JSON?

I have flow files with CSV (pipe-delimited) content that I'm converting to JSON. For the benefit of some later processing, I'd like all the JSON properties to be string values. How can I configure either the CSVReader or JSONRecordSetWriter to always output strings?

The inferred schema makes type decisions based on the values that it sees. The CSV files come from different sources with different fields, so I'm trying to avoid having to enumerate all the possible schemas. (I get that if I did that, I could specify type "string".) Is there a way to say that all properties should be strings?

Solution

Summary: CSVReader with Schema Access Strategy "Infer Schema" may create a schema with numeric types. CSVReader with Schema Access Strategy "Use String Fields From Header" creates a schema where all fields are string fields. In either case, the field names come from the first row.

Documentation

Documentation is at CSVReader Properties table, in the Schema Access Strategy row.

For "Infer Schema", hovering mouse pointer over its (?) icon shows

The Schema of the data will be inferred automatically when the data is read. See ... "Additional Details" for information about how the schema is inferred.

For "Use String Fields From Header", hovering mouse pointer over its (?) icon shows

The first non-comment line of the CSV file is a header line that contains the names of the columns. The schema will be derived by using the column names in the header and assuming that all column names are of type String.

Verifying (with Nifi 1.16.3):

Input file

id|version|date|time|timestamp|phase
123456|12.0|2019-12-28|23:58|2019-12-28T23:58:57.000Z|alpha
123465|12.1|2019-12-29|23:59|2019-12-29T23:59:58.000Z|beta

Flow

GetFile -success-> ConvertRecord -success-> PutFile -success-> LogAttribute

ConvertRecord Properties


Record Reader	CSVReader
Record Writer	JsonRecordSetWriter

JSONRecordSetWriter Properties


Schema Write Strategy	Set 'avro.schema' Attribute
Schema Access Strategy	Inherit Record Schema
Date Format	(No value set)
Time Format	(No value set)
Timestamp Format	(No value set)
Pretty Print JSON	true

AttributeLog Properties


Attributes to Log	avro.schema

1. With Schema Access Strategy: "Infer Schema" (and formats set), ...

CSVReader Properties


Schema Access Strategy	Infer Schema
Date Format	yyyy-MM-dd
Time Format	HH:mm
Timestamp Format	yyyy-MM-dd'T'HH:mm:ss.SSSX
CSV Format	Custom Format
Value Separator	\|
Record Separator	\n
Treat First Line as Header	true

the output JSON values contain unquoted numbers, and dates and times are unquoted integers (timestamp must parse but is kept as string), ...

[ {
  "id" : 123456,
  "version" : 12.0,
  "date" : 1577509200000,
  "time" : 104280000,
  "timestamp" : "2019-12-28T23:58:57.000Z",
  "phase" : "alpha"
}, {
  "id" : 123465,
  "version" : 12.1,
  "date" : 1577595600000,
  "time" : 104340000,
  "timestamp" : "2019-12-29T23:59:58.000Z",
  "phase" : "beta"
} ]

and the log shows that the avro.schema contains nullable numeric types for some columns. (manually prettified)

... "fields":[{"name":"id",       "type":["null","int"]},
              {"name":"version",  "type":["null","float"]},
              {"name":"date",     "type":["null",{"type":"int","logicalType":"date"}]},
              {"name":"time",     "type":["null",{"type":"int","logicalType":"time-millis"}]},
              {"name":"timestamp","type":["null","string"]},
              {"name":"phase",    "type":["null","string"]}]...

2. With Schema Access Strategy: "Use String Fields From Header", ...

CSVReader Properties


Schema Access Strategy	Use String Fields From Header
Date Format	(No value set)
Time Format	(No value set)
Timestamp Format	(No value set)
CSV Format	Custom Format
Value Separator	\|
Record Separator	\n
Treat First Line as Header	true

the output JSON values are in quoted strings as desired, ...

[ {
  "id" : "123456",
  "version" : "12.0",
  "date" : "2019-12-28",
  "time" : "23:58",
  "timestamp" : "2019-12-28T23:58:57.000Z",
  "phase" : "alpha"
}, {
  "id" : "123465",
  "version" : "12.1",
  "date" : "2019-12-29",
  "time" : "23:59",
  "timestamp" : "2019-12-29T23:59:58.000Z",
  "phase" : "beta"
} ]

and the log shows the avro.schema contains nullable string types for each column. (manually prettified)

... "fields":[{"name":"id",       "type":["null","string"]},
              {"name":"version",  "type":["null","string"]},
              {"name":"date",     "type":["null","string"]},
              {"name":"time",     "type":["null","string"]},
              {"name":"timestamp","type":["null","string"]},
              {"name":"phase",    "type":["null","string"]}]...