Search code examples
pythonpandasdataframebeautifulsoupstringio

Converting scraped text into Pandas data frame with BeautifulSoup


I am extracting some text from a website using the code below. I have it in the form of a string.

import requests
URL = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
r = requests.get(URL)
page = r.text

from bs4 import BeautifulSoup
soup = BeautifulSoup(page, 'lxml')
import re

strong_el = soup.find('strong',text='WHAT RESPONDENTS ARE SAYING …')

ul_tag = strong_el.find_next_sibling('ul')
LI_TAG =''
for li_tag in ul_tag.children:

    LI_TAG += li_tag.string

print LI_TAG

I am trying to create a data frame with 2 columns: 1) Comments 2) Industry (sub-string within the brackets). Getting some error when I tried to use StringIO as follows: 'TypeError: data argument can't be an iterator'. How can I convert these comments into a data frame?

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

LI_TAG = StringIO(LI_TAG)
df = pd.DataFrame(LI_TAG)

Solution

  • It seems like the LI_TAG variable is just a long string - so you're going to have to split it up to store it in a dataframe.

    import requests
    URL = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
    r = requests.get(URL)
    page = r.text
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(page, 'lxml')
    import re
    
    strong_el = soup.find('strong',text='WHAT RESPONDENTS ARE SAYING …')
    
    ul_tag = strong_el.find_next_sibling('ul')
    LI_TAG =''
    for li_tag in ul_tag.children:
    
        LI_TAG += li_tag.string
    
    # Convert to unicode to remove quotation marks \u201c and \u201d
    LI_TAG_U = unicode(LI_TAG)
    comments=[]
    industries=[]
    for string in LI_TAG.strip().split('\n'):
        comment, industry =  string.split(u'\u201d')
        comments.append(comment.strip(u'\u201c'))
        industries.append(industry.strip(' (').strip(')'))
    
    import pandas as pd
    
    data = pd.DataFrame()
    
    data['Comment']=comments
    data['Industry']=industries
    

    Hope this works for you!