I have some data on ElasticSearch that I need to send on HDFS. I'm trying to use pig (this is the first time I'm using it), but I have some problem to define a correct schema for my data.
First of all, I tried loading a JSON using the option 'es.output.json=true'
with org.elasticsearch.hadoop.pig.EsStorage
, and I can load/dump data correctly, and also save them as a JSON to HDFS using STORE A INTO 'hdfs://path/to/store';
. Later, defining an external table on HIVE, I can query this data. This is the full example that is working fine (I removed all SSL attributes from the code):
REGISTER /path/to/commons-httpclient-3.1.jar;
REGISTER /path/to/elasticsearch-hadoop-5.3.0.jar;
A = LOAD 'my-index/log' USING org.elasticsearch.hadoop.pig.EsStorage(
'es.nodes=https://addr1:port,https://addr2:port2,https://addr3:port3',
'es.query=?q=*',
'es.output.json=true');
STORE A INTO 'hdfs://path/to/store';
How can I store my data as AVRO to HDFS? I suppose I need to use AvroStorage
, but I should also define a schema loading the data, or the JSON is enough? I tried to define a schema with LOAD...USING...AS
command and setting es.mapping.date.rich=false
instead of es.output.json=true
(my data is quite complex, with map of maps and things like that), but it doesn't work. I'm not sure if the problem is on the syntax, or in the approach itself. Would be nice to have an hint on the correct direction to follow.
UPDATE
This is an example of what I tried with es.mapping.date.rich=false
. My problem is that if a field is null, all fields will be in a wrong order.
A = LOAD 'my-index/log' USING org.elasticsearch.hadoop.pig.EsStorage(
'es.nodes=https://addr1:port,https://addr2:port2,https://addr3:port3',
'es.query=?q=*',
'es.mapping.date.rich=false')
AS(
field1:chararray,
field2:chararray,
field3:map[chararray,fieldMap:map[],chararray],
field4:chararray,
field5:map[]
);
B = FOREACH A GENERATE field1, field2;
STORE B INTO 'hdfs://path/to/store' USING AvroStorage('
{
"type" : "foo1",
"name" : "foo2",
"namespace" : "foo3",
"fields" : [ {
"name" : "field1",
"type" : ["null","string"],
"default" : null
}, {
"name" : "field2",
"type" : ["null","string"],
"default" : null
} ]
}
');
For future readers, I decided to use spark
instead as it is much faster than pig
. To save avro
files on hdfs, I'm using the databrick
library.