Search code examples
htmljsonapache-drill

How do I query HTML embedded inside a string inside a JSON file with Apache Drill?


I'm trying to use Apache Drill (for the first time) on a JSON file that looks like this:

{
    "Key1": {
      "htmltags": "<htmltag attr1='bravo' /><htmltag attr2='delta' /><htmltag attr3='charlie' />"
    },
    "Key2": {
      "htmltags": "<htmltag attr1='kilo' /><htmltag attr2='lima' /><htmltag attr3='mike' />"
    },
    "Key3": {
      "htmltags": "<htmltag attr1='november' /><htmltag attr2='foxtrot' /><htmltag attr3='sierra' />"
    }
}

My initial query was the hello world of drill: SELECT * FROM DataFile.json, and returned me the columns Key1, Key2, Key3. They only had one row, and it contained the entry: "<htmltag attr1='bravo' /><htmltag attr2='delta' /><htmltag attr3='charlie' />" [i.e., only the entry Key1.htmltags].

I have two questions:

  1. Why was there only one row returned, when there were three differently valued entries for each key?
  2. After using the KVGEN/FLATTEN functions to get at my strings inside "htmltags" above, is there a way to drill further into (analyse and extract data from) the HTML tags?

Solution

  • Unfortunately, it looks like Drill isn't the right tool (v1.1.0 as of this writing on Homebrew) for the job.

    1. It looks like there is a bug with the system which is the reason why there is only one row despite multiple columns. I've filed a report: https://issues.apache.org/jira/browse/DRILL-4102
    2. I've scoured the documentation once again, there are no tools to analyse HTML or XML natively. Depending on string manipulation for this is not a task I relish.

    Hence, I'll go with an XML parser, DOM tree crawler or the like, and use a bash string function to extract the target tag strings awk/tee.