Search code examples
jsonsolribm-cloudnutchretrieve-and-rank

Indexing nutch crawled data in "Bluemix" solr


I'm trying to index the nutch crawled data by Bluemix solr and I cannot find anyway to do it. My main question is: Is there anybody that can help me to do so? what should I do to send the result of my nutch crawled data to my Blumix Solr. For the crawling I used nutch 1.11 and here is a part of what I did to now and the problems I faced: I thought there may be two possible solutions:

  1. By nutch command:

“NUTCH_PATH/bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/ -Dsolr.server.url="OURSOLRURL"”

I can index the nutch crawled data by OURSOLR. However, I found some problem with that.

a-Though it sounds really odd, it could not accept the URL. I could handle it by using the URL’s Encode instead.

b-Since I have to connect to a specific Username and password, nutch could not connect to my solr. Considering this:

 Active IndexWriters :
 SolrIndexWriter
    solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent')
    solr.server.url : URL of the Solr instance (mandatory)
    solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type)
    solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.commit.size : buffer size when sending to Solr (default 1000)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication

in the command line output,I tried to manage this problem by using authentication parameters of the command "solr.auth=true solr.auth.username="SOLR-UserName" solr.auth.password="Pass" to it.

So up to now I’ve got to a point to use this command:

”bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016* solr.server.url="https%3A%2F%2Fgateway.watsonplatform.net%2Fretrieve-and-rank%2Fapi%2Fv1%2Fsolr_clusters%2FCLUSTER-ID%2Fsolr%2Fadmin%2Fcollections" solr.auth=true solr.auth.username="USERNAME" solr.auth.password="PASS"“.

But for some reason that I couldn’t realize yet, the command considers the authentication parameters as crawled data directory and does not work. So I guess it is not the right way to "Active IndexWriters" can anyone tell me then how can I??

  1. By curl command:

“curl -X POST -H "Content-Type: application/json" -u "BLUEMIXSOLR-USERNAME":"BLUEMIXSOLR-PASS" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTERS-ID/solr/example_collection/update" --data-binary @{/path_to_file}/FILE.json”

I thought maybe I can feed json files created by this command:

bin/nutch commoncrawldump -outputDir finalcrawlResult/ -segment crawl/segments -gzip -extension json -SimpleDateFormat -epochFilename -jsonArray -reverseKey but there are some problems here.

a. this command provides so many files in complicated Paths which will take so much time to manually post all of them.I guess for big cawlings it may be even impossible. Is there any way to POST all the files in a directory and its subdirectories at once by just one command??

b. there is a weird name "ÙÙ÷yœ" at the start of json files created by commoncrawldump.

c. I removed the name weird name and tried to POST just one of these files but here is the result:

 {"responseHeader":{"status":400,"QTime":23},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"Unknown command 'url' at [9]","code":400}}

Does it mean these files cannot be fed to Bluemix solr and it is all useless for me?


Solution

  • For indexing nutch crawled data in Bluemix Retrieve and Rank service one should:

    1. Crawl seeds with nutch e.g

      $:bin/crawl -w 5 urls crawl 25

    you can check the status of crawling with:

    bin/nutch readdb crawl/crawldb/ -stats

    1. Dumped the crawled dataas files:

      $:bin/nutch dump -flatdir -outputDir dumpData/ -segment crawl/segments/

    2. Posted those that are possible e.g xml files to solr Collection on Retrieve and Rank:

      Post_url = '"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update"' %(solr_cluster_id, solr_collection_name) cmd ='''curl -X POST -H %s -u %s %s --data-binary @%s''' %(Cont_type_xml, solr_credentials, Post_url, myfilename) subprocess.call(cmd,shell=True)

    3. Converted the rest to json with Bluemix Doc-Conv service:

      doc_conv_url = '"https://gateway.watsonplatform.net/document-conversion/api/v1/convert_document?version=2015-12-15"'
      cmd ='''curl -X POST -u %s -F config="{\\"conversion_target\\":\\"answer_units\\"}" -F file=@%s %s''' %(doc_conv_credentials, myfilename, doc_conv_url)
      process = subprocess.Popen(cmd, shell= True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
      

    and then save these Json results in a json file.

    1. Post this json file to the collection:

      Post_converted_url = '"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update/json/docs?commit=true&split=/answer_units/id&f=id:/answer_units/id&f=title:/answer_units/title&f=body:/answer_units/content/text"' %(solr_cluster_id, solr_collection_name)
      cmd ='''curl -X POST -H %s -u %s %s --data-binary @%s''' %(Cont_type_json, solr_credentials, Post_converted_url, Path_jsonFile)
      subprocess.call(cmd,shell=True)
      
    2. Send Queries:

      pysolr_client = retrieve_and_rank.get_pysolr_client(solr_cluster_id, solr_collection_name)
      results = pysolr_client.search(Query_term)
      print(results.docs)
      

    Codes are in python. For beginners: You can use the curl commands directly in you CMD. I hope it helpes