We are trying to index the sample xml files to cloudera solr using flume MorphlineSolrSink.
We have created 2 channels ( solrchannel, hdfschannel) and 2 sink (solrsink, hdfssink). We are able to index the document in cloudera solr using this flume and morphline configuration.
Question 1) : We have 2 fields title and content in XML file and we want to strip the HTML content from these 2 fields before sending it to SOLR. Could you please tell how we can achieve it?
Question 2) : I have to change the Date format of 2 fields, createDate and PublishedDate. Could you please let me know how to write the logic to change the dateformat of both the fileds at one go.
I am using xQuery to extract the date from my XML files.
morphline.conf https://gist.github.com/jsbonline2006/e04433f9b11cdcafa865#file-morphline-conf
I found the following solution for my problem and hence I wanted to share with you guys:
2) After the Xquery command block I wrote following code to convert the date into required format and it worked perfectly fine.
convertTimestamp {
field : createDate
inputFormats : ["E MMM dd HH:mm:ss z yyyy", "yyyy-MM-dd"]
inputTimezone : UTC
outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
outputTimezone : America/Los_Angeles
convertTimestamp {
field : publishedDate
inputFormats : ["E MMM dd HH:mm:ss z yyyy", "yyyy-MM-dd"]
inputTimezone : UTC
outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
outputTimezone : America/Los_Angeles
1) For Stripping the HTML tags from title and content we have written a Java code and that we have plugged into our pipeline before send the file content to flume.
Hope this Helps you as well!!!!