Search code examples
pythonpython-2.7beautifulsouppython-unicode

How to find string and return it to stdout in Python


I am getting familiar with Python & am struggling to do the below with BeautifulSoup, Python.

What is expected:

*If the output of the script below contains the string 5378, it should email me with the line the string appears.

#! /usr/bin/env python

from bs4 import BeautifulSoup
from lxml import html
import urllib2,re

import codecs
import sys
streamWriter = codecs.lookup('utf-8')[-1]
sys.stdout = streamWriter(sys.stdout)

BASE_URL = "http://outlet.us.dell.com/ARBOnlineSales/Online/InventorySearch.aspx?c=us&cs=22&l=en&s=dfh&brandid=2201&fid=111162"

webpage = urllib2.urlopen(BASE_URL)
soup = BeautifulSoup(webpage.read(), "lxml")
findcolumn = soup.find("div", {"id": "itemheader-FN"})
name = findcolumn.text.strip()
print name

I tried using findall(5378, name), but it returns to empty braces like this [].

  • I am struggling with Unicode issues if I am trying to use it along with grep.

$ python dell.py | grep 5378 Traceback (most recent call last): File "dell.py", line 18, in <module> print name UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 817: ordinal not in range(128)

Can someone tell me what am I doing wrong in both cases?


Solution

  • The function findall (from the re module) expects the first parameter to be a a regular expression, which is a string, but you provided an integer. Try this instead:

    re.findall("5378", name)
    

    When printed this will output [u'5378'] when it found something or [] when it didn't.

    I suspect you want to retrieve the product name from the number, which means you have to iterate through elements in findcolumn. We can use re.search() here to check for a single match within the element's texts.

    for input_element in findcolumn.find_all("div"):
        name = unicode(input_element.text.strip())
        if re.search("5378", name) != None:
            print unicode(name)
    

    As for the unicode error, there are a bunch of solutions, depending on your operating system and configuration: Reconfigure your system locale on Ubuntu or Encode your script output with .encode()/unicode().