Search code examples
hadoophiveapache-kafkahdfsspring-cloud-dataflow

Spring Cloud Dataflow - http | kafka and kafka | hdfs - Getting Raw message in HDFS


I am creating a basic stream in SCDF (Local Server 1.7.3) wherein I am configuring 2 streams. 1. HTTP -> Kafka Topic 2. Kafka Topic -> HDFS

Streams:

stream create --name ingest_from_http --definition "http --port=8000 --path-pattern=/test > :streamtest1"
stream deploy --name ingest_from_http --properties "app.http.spring.cloud.stream.bindings.output.producer.headerMode=raw"

stream create --name ingest_to_hdfs --definition ":streamtest1 > hdfs --fs-uri=hdfs://<host>:8020 --directory=/tmp/hive/sensedev/streamdemo/ --file-extension=xml --spring.cloud.stream.bindings.input.consumer.headerMode=raw" 

I have created a Hive managed table on location /tmp/hive/sensedev/streamdemo/

DROP TABLE IF EXISTS gwdemo.xml_test;
CREATE TABLE gwdemo.xml_test(

id int,

name string

 )

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'

WITH SERDEPROPERTIES (

"column.xpath.id"="/body/id/text()",

"column.xpath.name"="/body/name/text()"


)

STORED AS

INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'

LOCATION '/tmp/hive/sensedev/streamdemo'

TBLPROPERTIES (

"xmlinput.start"="<body>",

"xmlinput.end"="</body>")

;

Testing:

  1. Whether Hive is able to read XML : Put a xml file in the location /tmp/hive/sensedev/streamdemo.

File Content: <body><id>1</id><name>Test1</name></body>

On running a SELECT command on the table, it was showing the above record properly.

  1. When posting record in SCDF with http post, I am getting proper data in Kafka Consumer but when I am checking HDFS, the xml files are being created but I am receiving raw messages in those files. Example:

    dataflow>http post --target http:///test --data "<body><id>2</id><name>Test2</name></body>" --contentType application/xml

In Kafka Console Consumer, I am able to read proper XML message: <body><id>2</id><name>Test2</name></body>

 $ hdfs dfs -cat /tmp/hive/sensedev/streamdemo/hdfs-sink-2.xml
[B@31d94539

Questions: 1. What am I missing? How can I get proper XML records in the newly created XML files in HDFS?


Solution

  • HDFS Sink expects a Java Serialized object.