I have flow files with CSV (pipe-delimited) content that I'm converting to JSON. For the benefit of some later processing, I'd like all the JSON properties to be string values. How can I configure either the CSVReader or JSONRecordSetWriter to always output strings?
The inferred schema makes type decisions based on the values that it sees. The CSV files come from different sources with different fields, so I'm trying to avoid having to enumerate all the possible schemas. (I get that if I did that, I could specify type "string".) Is there a way to say that all properties should be strings?
Summary: CSVReader with Schema Access Strategy "Infer Schema" may create a schema with numeric types. CSVReader with Schema Access Strategy "Use String Fields From Header" creates a schema where all fields are string fields. In either case, the field names come from the first row.
Documentation is at CSVReader Properties table, in the Schema Access Strategy row.
For "Infer Schema", hovering mouse pointer over its (?) icon shows
The Schema of the data will be inferred automatically when the data is read. See ... "Additional Details" for information about how the schema is inferred.
For "Use String Fields From Header", hovering mouse pointer over its (?) icon shows
The first non-comment line of the CSV file is a header line that contains the names of the columns. The schema will be derived by using the column names in the header and assuming that all column names are of type String.
Input file
id|version|date|time|timestamp|phase
123456|12.0|2019-12-28|23:58|2019-12-28T23:58:57.000Z|alpha
123465|12.1|2019-12-29|23:59|2019-12-29T23:59:58.000Z|beta
Flow
GetFile -success-> ConvertRecord -success-> PutFile -success-> LogAttribute
ConvertRecord Properties
Record Reader | CSVReader |
Record Writer | JsonRecordSetWriter |
JSONRecordSetWriter Properties
Schema Write Strategy | Set 'avro.schema' Attribute |
Schema Access Strategy | Inherit Record Schema |
Date Format | (No value set) |
Time Format | (No value set) |
Timestamp Format | (No value set) |
Pretty Print JSON | true |
AttributeLog Properties
Attributes to Log | avro.schema |
CSVReader Properties
Schema Access Strategy | Infer Schema |
Date Format | yyyy-MM-dd |
Time Format | HH:mm |
Timestamp Format | yyyy-MM-dd'T'HH:mm:ss.SSSX |
CSV Format | Custom Format |
Value Separator | | |
Record Separator | \n |
Treat First Line as Header | true |
the output JSON values contain unquoted numbers, and dates and times are unquoted integers (timestamp must parse but is kept as string), ...
[ {
"id" : 123456,
"version" : 12.0,
"date" : 1577509200000,
"time" : 104280000,
"timestamp" : "2019-12-28T23:58:57.000Z",
"phase" : "alpha"
}, {
"id" : 123465,
"version" : 12.1,
"date" : 1577595600000,
"time" : 104340000,
"timestamp" : "2019-12-29T23:59:58.000Z",
"phase" : "beta"
} ]
and the log shows that the avro.schema contains nullable numeric types for some columns. (manually prettified)
... "fields":[{"name":"id", "type":["null","int"]},
{"name":"version", "type":["null","float"]},
{"name":"date", "type":["null",{"type":"int","logicalType":"date"}]},
{"name":"time", "type":["null",{"type":"int","logicalType":"time-millis"}]},
{"name":"timestamp","type":["null","string"]},
{"name":"phase", "type":["null","string"]}]...
CSVReader Properties
Schema Access Strategy | Use String Fields From Header |
Date Format | (No value set) |
Time Format | (No value set) |
Timestamp Format | (No value set) |
CSV Format | Custom Format |
Value Separator | | |
Record Separator | \n |
Treat First Line as Header | true |
the output JSON values are in quoted strings as desired, ...
[ {
"id" : "123456",
"version" : "12.0",
"date" : "2019-12-28",
"time" : "23:58",
"timestamp" : "2019-12-28T23:58:57.000Z",
"phase" : "alpha"
}, {
"id" : "123465",
"version" : "12.1",
"date" : "2019-12-29",
"time" : "23:59",
"timestamp" : "2019-12-29T23:59:58.000Z",
"phase" : "beta"
} ]
and the log shows the avro.schema contains nullable string types for each column. (manually prettified)
... "fields":[{"name":"id", "type":["null","string"]},
{"name":"version", "type":["null","string"]},
{"name":"date", "type":["null","string"]},
{"name":"time", "type":["null","string"]},
{"name":"timestamp","type":["null","string"]},
{"name":"phase", "type":["null","string"]}]...