Search code examples
solrclouderaflume

How to strip HTML content in flume morphline.conf file using Xquery


We are trying to index the sample xml files to cloudera solr using flume MorphlineSolrSink.

We have created 2 channels ( solrchannel, hdfschannel) and 2 sink (solrsink, hdfssink).   We are able to index the document in cloudera solr using this flume and morphline configuration.

Question 1) : We have 2 fields title and content in XML file and we want to strip the HTML content from these 2 fields before sending it to SOLR. Could you please tell how we can achieve it?

Question 2) : I have to change the Date format of 2 fields, createDate and PublishedDate. Could you please let me know how to write the logic to change the dateformat of both the fileds at one go.

I am using xQuery to extract the date from my XML files.


morphline.conf https://gist.github.com/jsbonline2006/e04433f9b11cdcafa865#file-morphline-conf



Solution

  • I found the following solution for my problem and hence I wanted to share with you guys:

    2) After the Xquery command block I wrote following code to convert the date into required format and it worked perfectly fine.

      {
        convertTimestamp {
          field : createDate
          inputFormats : ["E MMM dd HH:mm:ss z yyyy", "yyyy-MM-dd"]
          inputTimezone : UTC
          outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
          outputTimezone : America/Los_Angeles
        }
      }
    
      {
        convertTimestamp {
          field : publishedDate
          inputFormats : ["E MMM dd HH:mm:ss z yyyy", "yyyy-MM-dd"]
          inputTimezone : UTC
          outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
          outputTimezone : America/Los_Angeles
        }
      }
    

    1) For Stripping the HTML tags from title and content we have written a Java code and that we have plugged into our pipeline before send the file content to flume.

    Hope this Helps you as well!!!!

    Regards,

    Jayesh Bhoyar

    http://technical-fundas.blogspot.in/p/technical-profile.html