Search code examples
pythonpython-2.7scrapyscrapy-pipeline

Python - Remove tab and new line in Object


Just a new user of scrapy.org and a newbie to Python. I have this values at brand and title properties (JAVA OOP Term) that contains tab spaces and new line. How can we trim it to make this 2 following object properties to have this plain string value

item['brand'] = "KORAL ACTIVEWEAR"
item['title'] = "Boom Leggings"

Below is the data structure

{'store_id': 870, 'sale_price_low': [], 'brand': [u'\n                KORAL ACTIVEWEAR\n              '], 'currency': 'AUD', 'retail_price': [u'$140.00'], 'category': [u'Activewear'], 'title': [u'\n                Boom Leggings\n              '], 'url': [u'/boom-leggings-koral-activewear/vp/v=1/1524019474.htm?folderID=13331&fm=other-shopbysize-viewall&os=false&colorId=68136'], 'sale_price_high': [], 'image_url': [u'  https://images-na.sample-store.com/images/G/01/samplestore/p/prod/products/kacti/kacti3025868136/kacti3025868136_q1_2-0._SH20_QL90_UY365_.jpg\n'], 'category_link': 'https://www.samplestore.com/clothing-activewear/br/v=1/13331.htm?baseIndex=500', 'store': 'SampleStore'}

I was able to trim the prices to only get the number and decimal by using regex search method, which I think might be wrong when there is a price comma separator.

price = re.compile('[0-9\.]+')
item['retail_price'] = filter(price.search, item['retail_price'])

Solution

  • It looks like all you need to do, at least for this example, is strip all whitespace off the edges of the brand and title values. You don't need a regex for that, just call the strip method.

    However, your brand isn't a single string; it's a list of strings (even if there's only one string in the list). So, if you try to just strip it, or run a regex on it, you're going to get an AttributeError or TypeError from trying to treat that list as a string.

    To fix this, you need to map the strip over all of the strings, with either the map function or a list comprehension:

    item['brand'] = [brand.strip() for brand in item['brand']]
    item['title'] = map(str.strip, item['title'])
    

    … whichever of the two is easier for you to understand.


    If you have other examples that have embedded runs of whitespace, and you want to turn every such run into exactly one space character, you need to use the sub method with your regex:

    item['brand'] = [re.sub(ur'\s+', u' ', brand.strip() for brand in item['brand']]
    

    Notice the u prefixes. In Python 2, you need a u prefix to make a unicode literal instead of a str (encoded bytes) literal. And it's important to use Unicode patterns against Unicode strings, even if the pattern itself doesn't care about any non-ASCII characters. (If all of this seems like a pointless pain and a bug magnet—well, it is; that's the main reason Python 3 exists.)


    As for the retail_price, the same basic observations apply. Again, it's a list of strings, not just a string. And again, you probably don't need regex. Assuming the price is always a $ (or other single-character currency marker) followed by a number, just slice off the $ and call float or Decimal on it:

    item['retail_price'] = [float(price[1:]) for price in item['retail_price']]
    

    … but if you have examples that look different, with arbitrary extra characters on both sides of the price, you can use re.search here, but you'll still need to map it, and to use a Unicode pattern.

    You also need to grab the matching group out of the search, and to handle empty/invalid strings in some way (they'll return None for the search, and you can't convert that to a float). You have to decide what to do about it, but from your attempt with filter it looks like you just want to skip them. This is complicated enough that I'd do it in multiple steps:

    prices = item['price']
    matches = (re.search(r'[0-9.]+', price) for price in prices)
    groups = (match.group() for match in matches if match)
    item['price'] = map(float, validmatches)
    

    … or maybe wrap that in a function.