Search code examples
arraysjsonoracle-databaseapache-nifi

Getting NiFi to read properly formatted JSON from an Oracle view without transforming/reformatting it


Brief high-level overview: trying to go Oracle --> NiFi --> Solr.

I have been trying to get data out of an Oracle view into NiFi as JSON for ultimate streaming into Solr. For ordinary columns (varchar2, integer, etc.) this is not a problem. A simple select * from view from an ExecuteSQLRecord processor works great; the data in Nifi is properly formed JSON and is streamed into Solr perfectly.

The problem is, I have some columns that represent 1:N data, i.e., arrays. I am replacing an outdated process, swapping NiFi for an obsolete ETL tool. The way this used to work for these 1:N columns was that the old ETL tool would create XML of this format:

<results>
    <row>
        <field name="town">sleepy town</field>
    </row>
    <row>
        <field name="town">boring town</field>
    </row>
</results>

Apparently, there was some magic in the built-in Solr component in this old ETL tool where this XML format was transformed so that it ends up looking like a JSON string array in Solr:

"array_column":
    ["sleepy town",
     "boring town"]

I was originally creating XML in the Oracle view to mimic the old ETL tool before I realized that this magical transformation was going away. Since all the other ordinary columns (varchar2, integer, etc.) were coming through NiFi and to Solr properly, I thought I would reformat these 1:N columns so that they looked like JSON arrays, which I did.

The problem is, no matter what I do in NiFi (or in the view in Oracle), NiFi will only treat the data coming in from Oracle as a monolithic string. In other words, what the view returns in Oracle is this:

select array_column from view

ARRAY_COLUMN
------------
["sleepy_town","boring town"]

But in NiFi, when looking at the data provenance for the ExecuteSQLRecord processor, this is what it shows:

[
  {
    "SOME_ORDINARY_COLUMNS":"Everything looks great!",
    "ARRAY_COLUMN": "[\"sleepy town\",\"boring town\"]",
    "MORE_ORDINARY_COLUMNS":"Everything looks great!",
  }
]

and when I get Avro involved, it looks like this:

{
  "SOME_ORDINARY_COLUMNS" : {
    "string" : "Everything looks great!"
  },
  "ARRAY_COLUMN" : {
    "string" :  "[\"sleepy town\",\"boring town\"]"
  },
  "MORE_ORDINARY_COLUMNS" : {
    "string" : "Everything looks great!"
  }
}

I have tried many things:

  • Changed the Schema Write Strategy on the JsonRecordSetWriter to Set 'avro.schema' Attribute so that I could look at the schema, then used "Use 'Schema Text' Property" to try to force the ARRAY_COLUMN to be an array of strings (many variations)
  • Many variations of using an Avro writer and sending the output to a QueryRecord processor with either Avro or JSON readers, with both inferred and explicit schemas specified
  • Setting the JSONPath in the JsonPathReader for the QueryRecord processor to $.ARRAY_COLUMN[*] and/or $.ARRAY_COLUMN.[*]
  • Changing the query in ExecuteSQLRecord to select JSON_OBJECT(...) instead of select * (also tried dot notation with the JSON_OBJECT as a subquery)
  • Investigated VARRAY and nested tables in Oracle, but that seemed like the wrong direction

All of these either resulted in a "can't convert string to array" error in NiFi, or the 1:N column remaining a monolithic string with double-quotes escaped.

I'm sure there are other variations I've tried between the lines related to these, but I've been fighting with this for a couple days, and it's becoming blurry. I know in Oracle, with JSON_OBJECT there's a "FORMAT JSON" option that tells Oracle, "Hey, this IS JSON, so don't reformat it!" I wish there was something like that in Nifi, some way to tell NiFi to look at what's coming in from Oracle as properly formatted JSON that doesn't need to be monkeyed with.

I haven't done anything with it, but I imagine I could get really fancy and write some groovy or something, but I am really trying to do this with just native NiFi processors without having to lean on custom scripting.


Solution

  • Although I had considered using a collection, Alex Poole's first comment pushed me back in that direction, and that was what got it done. I switched around my view to use nested tables:

    create or replace type my_schema.clob_nested_table as table of clob

    Then in the view:

    with main_query as
    (
        select case_id, other_fields, more_fields, transformation1, transformation2
          from parent_table
          join lookup_table
          join another_lookup_table_etc
    ),
    child_a as
    (
        select a.case_id
               cast(multiset(select to_clob(nvl(b.column1, '') || nvl(b.column2, '') || nvl(b.etc, ''))
                               from child_table_a b
                              where a.case_id = b.case_id) as my_schema.clob_nested_table) as child_a_data
          from main_query a
    )
    select a.case_id, a.other_fields, a.more_fields, a.transformation1, a.transformation2,
           b.child_a_data
      from main_query a
      left outer join child_a b on a.case_id = b.case_id
    

    Of course, that's pseudo-code, and not 100% syntactically correct, but it shows the magic. There are 54 of these "child" tables, so it's a rather lengthy view, and there's a lot of transformation going on.

    The solution was to use nested tables, and the data shows up in NiFi looking like a properly formatted JSON array, and then gets sent off to Solr, where it looks exactly as it should. Thanks, Alex!

    UPDATE:

    A colleague ran up against a similar problem, and he ended up solving it with a Jolt transformation. So I tried that out for my flow, and it worked like magic. I had looked at Jolt transformations, but I will admit, I had never touched Jolt, and the learning curve looked steeper than I wanted to commit.

    But basically, once you have your one-to-many data comma-separated in a single column (via listagg or xmlagg), pass it through a JoltTransformJSON processor with the following specification:

    [
      {
        "operation": "shift",
        "spec": { "*": ""}
      },
      {
        "operation": "modify-overwrite-beta",
        "spec": { "*": { "MyFirstCSVColumn": "=split(',',@(1,&))" } }
      },
      {
        "operation": "modify-overwrite-beta",
        "spec": { "*": { "MyNextCSVColumn": "=split(',',@(1,&))" } }
      }
    ]
    

    That first shift operation catches all the "normal" columns in your table. Also, because some of my data had embedded commas, I ended up using the backwards single-quote (`) as my delimiter, starting with the listagg/xmlaggand also in the Jolt specification.