Search code examples
pythonunicodescrapy

Strip \n \t \r in scrapy


I'm trying to strip \r \n \t characters with a scrapy spider, making then a json file.

I have a "description" object which is full of new lines, and it doesn't do what I want: matching each description to a title.

I tried with map(unicode.strip()) but it doesn't really works. Being new to scrapy I don't know if there's another simpler way or how map unicode really works.

This is my code:

def parse(self, response):
    for sel in response.xpath('//div[@class="d-grid-main"]'):
        item = xItem()
        item['TITLE'] = sel.xpath('xpath').extract()
        item['DESCRIPTION'] = map(unicode.strip, sel.xpath('//p[@class="class-name"]/text()').extract())

I tried also with:

item['DESCRIPTION'] = str(sel.xpath('//p[@class="class-name"]/text()').extract()).strip()

But it raised an error. What's the best way?


Solution

  • unicode.strip only deals with whitespace characters at the beginning and end of strings

    Return a copy of the string with the leading and trailing characters removed.

    not with \n, \r, or \t in the middle.

    You can either use a custom method to remove those characters inside the string (using the regular expression module), or even use XPath's normalize-space()

    returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space.

    Example python shell session:

    >>> text='''<html>
    ... <body>
    ... <div class="d-grid-main">
    ... <p class="class-name">
    ... 
    ...  This is some text,
    ...  with some newlines \r
    ...  and some \t tabs \t too;
    ... 
    ... <a href="http://example.com"> and a link too
    ...  </a>
    ... 
    ... I think we're done here
    ... 
    ... </p>
    ... </div>
    ... </body>
    ... </html>'''
    >>> response = scrapy.Selector(text=text)
    >>> response.xpath('//div[@class="d-grid-main"]')
    [<Selector xpath='//div[@class="d-grid-main"]' data=u'<div class="d-grid-main">\n<p class="clas'>]
    >>> div = response.xpath('//div[@class="d-grid-main"]')[0]
    >>> 
    >>> # you'll want to use relative XPath expressions, starting with "./"
    >>> div.xpath('.//p[@class="class-name"]/text()').extract()
    [u'\n\n This is some text,\n with some newlines \r\n and some \t tabs \t too;\n\n',
     u"\n\nI think we're done here\n\n"]
    >>> 
    >>> # only leading and trailing whitespace is removed by strip()
    >>> map(unicode.strip, div.xpath('.//p[@class="class-name"]/text()').extract())
    [u'This is some text,\n with some newlines \r\n and some \t tabs \t too;', u"I think we're done here"]
    >>> 
    >>> # normalize-space() will get you a single string on the whole element
    >>> div.xpath('normalize-space(.//p[@class="class-name"])').extract()
    [u"This is some text, with some newlines and some tabs too; and a link too I think we're done here"]
    >>>