Brief high-level overview: trying to go Oracle --> NiFi --> Solr.
I have been trying to get data out of an Oracle view into NiFi as JSON for ultimate streaming into Solr. For ordinary columns (varchar2, integer, etc.) this is not a problem. A simple select * from view
from an ExecuteSQLRecord processor works great; the data in Nifi is properly formed JSON and is streamed into Solr perfectly.
The problem is, I have some columns that represent 1:N data, i.e., arrays. I am replacing an outdated process, swapping NiFi for an obsolete ETL tool. The way this used to work for these 1:N columns was that the old ETL tool would create XML of this format:
<results>
<row>
<field name="town">sleepy town</field>
</row>
<row>
<field name="town">boring town</field>
</row>
</results>
Apparently, there was some magic in the built-in Solr component in this old ETL tool where this XML format was transformed so that it ends up looking like a JSON string array in Solr:
"array_column":
["sleepy town",
"boring town"]
I was originally creating XML in the Oracle view to mimic the old ETL tool before I realized that this magical transformation was going away. Since all the other ordinary columns (varchar2, integer, etc.) were coming through NiFi and to Solr properly, I thought I would reformat these 1:N columns so that they looked like JSON arrays, which I did.
The problem is, no matter what I do in NiFi (or in the view in Oracle), NiFi will only treat the data coming in from Oracle as a monolithic string. In other words, what the view returns in Oracle is this:
select array_column from view
ARRAY_COLUMN
------------
["sleepy_town","boring town"]
But in NiFi, when looking at the data provenance for the ExecuteSQLRecord processor, this is what it shows:
[
{
"SOME_ORDINARY_COLUMNS":"Everything looks great!",
"ARRAY_COLUMN": "[\"sleepy town\",\"boring town\"]",
"MORE_ORDINARY_COLUMNS":"Everything looks great!",
}
]
and when I get Avro involved, it looks like this:
{
"SOME_ORDINARY_COLUMNS" : {
"string" : "Everything looks great!"
},
"ARRAY_COLUMN" : {
"string" : "[\"sleepy town\",\"boring town\"]"
},
"MORE_ORDINARY_COLUMNS" : {
"string" : "Everything looks great!"
}
}
I have tried many things:
select JSON_OBJECT(...)
instead of select *
(also tried dot notation with the JSON_OBJECT as a subquery)All of these either resulted in a "can't convert string to array" error in NiFi, or the 1:N column remaining a monolithic string with double-quotes escaped.
I'm sure there are other variations I've tried between the lines related to these, but I've been fighting with this for a couple days, and it's becoming blurry. I know in Oracle, with JSON_OBJECT there's a "FORMAT JSON" option that tells Oracle, "Hey, this IS JSON, so don't reformat it!" I wish there was something like that in Nifi, some way to tell NiFi to look at what's coming in from Oracle as properly formatted JSON that doesn't need to be monkeyed with.
I haven't done anything with it, but I imagine I could get really fancy and write some groovy or something, but I am really trying to do this with just native NiFi processors without having to lean on custom scripting.
Although I had considered using a collection, Alex Poole's first comment pushed me back in that direction, and that was what got it done. I switched around my view to use nested tables:
create or replace type my_schema.clob_nested_table as table of clob
Then in the view:
with main_query as
(
select case_id, other_fields, more_fields, transformation1, transformation2
from parent_table
join lookup_table
join another_lookup_table_etc
),
child_a as
(
select a.case_id
cast(multiset(select to_clob(nvl(b.column1, '') || nvl(b.column2, '') || nvl(b.etc, ''))
from child_table_a b
where a.case_id = b.case_id) as my_schema.clob_nested_table) as child_a_data
from main_query a
)
select a.case_id, a.other_fields, a.more_fields, a.transformation1, a.transformation2,
b.child_a_data
from main_query a
left outer join child_a b on a.case_id = b.case_id
Of course, that's pseudo-code, and not 100% syntactically correct, but it shows the magic. There are 54 of these "child" tables, so it's a rather lengthy view, and there's a lot of transformation going on.
The solution was to use nested tables, and the data shows up in NiFi looking like a properly formatted JSON array, and then gets sent off to Solr, where it looks exactly as it should. Thanks, Alex!
UPDATE:
A colleague ran up against a similar problem, and he ended up solving it with a Jolt transformation. So I tried that out for my flow, and it worked like magic. I had looked at Jolt transformations, but I will admit, I had never touched Jolt, and the learning curve looked steeper than I wanted to commit.
But basically, once you have your one-to-many data comma-separated in a single column (via listagg
or xmlagg
), pass it through a JoltTransformJSON
processor with the following specification:
[
{
"operation": "shift",
"spec": { "*": ""}
},
{
"operation": "modify-overwrite-beta",
"spec": { "*": { "MyFirstCSVColumn": "=split(',',@(1,&))" } }
},
{
"operation": "modify-overwrite-beta",
"spec": { "*": { "MyNextCSVColumn": "=split(',',@(1,&))" } }
}
]
That first shift
operation catches all the "normal" columns in your table. Also, because some of my data had embedded commas, I ended up using the backwards single-quote (`) as my delimiter, starting with the listagg
/xmlagg
and also in the Jolt specification.