Search code examples
pythonapachesolrscrapypipeline

Why won't my connection to Solr work?


I am crawling websites using Scrapy and then sending that data to Solr to be indexed. The data is being sent through an item pipeline that uses one of Solr's Python client's--mysolr.

The spider works correctly and my items array has two items with the correct fields. This array is called by the process_item function in the pipeline.

Item Pipeline

from mysolr import Solr

class SolrPipeline(object):
    def __init__(self):
        self.client = Solr('http://localhost:8983/solr', version=4)
        response = self.client.search(q='Title')
        print response

    def process_item(self, item, spider):
        docs = [
            {'title' : item["title"],
             'subtitle' : item["subtitle"]   
            },
            {'title': item["title"],
             'subtitle': item["subtitle"]
            }
        ]
        print docs
        self.client.update(docs, 'json', commit=False)
        self.client.commit()

This is where I get my problem. The response that gets printed is < SolrResponse status=404 >. I used the SOLR_URL that appears whenever I launch the Admin UI of Solr.

Another error I get is below.

2015-08-25 09:06:53 [urllib3.connectionpool] INFO: Starting new HTTP connection (1): localhost
2015-08-25 09:06:53 [urllib3.connectionpool] DEBUG: Setting read timeout to None
2015-08-25 09:06:53 [urllib3.connectionpool] DEBUG: "POST /update/json HTTP/1.1" 404 1278
2015-08-25 09:06:53 [urllib3.connectionpool] INFO: Starting new HTTP connection (1): localhost
2015-08-25 09:06:53 [urllib3.connectionpool] DEBUG: Setting read timeout to None
2015-08-25 09:06:53 [urllib3.connectionpool] DEBUG: "POST /update HTTP/1.1" 404 1273

The six lines appear twice (once for each item I am trying to add I presume).


Solution

  • You want to do a POST request with JSON data, but in fact passing a Python list of dictionaries to the self.client.update() method.

    Convert the Python list of dictionaries to JSON:

    import json
    from mysolr import Solr
    
    class SolrPipeline(object):
        def __init__(self):
            self.client = Solr('http://localhost:8983/solr', version=4)
            response = self.client.search(q='Title')
            print response
    
        def process_item(self, item, spider):
            docs = [
                {'title' : item["title"],
                 'subtitle' : item["subtitle"]   
                },
                {'title': item["title"],
                 'subtitle': item["subtitle"]
                }
            ]
    
            docs = json.dumps(docs)  # convert to JSON
            self.client.update(docs, 'json', commit=False)
            self.client.commit()