Search code examples
pythonhtmldataframeweb-scrapingpre

Is there an easy way to get the content from a <pre> tag to a pandas dataframe?


I've trying to pass the content of a pre tag to a pandas dataframe but i've not been able to, this is what i have so far:

import requests,pandas
from bs4 import BeautifulSoup

#url

url='http://weather.uwyo.edu/cgi-bin/sounding?region=samer&TYPE=TEXT%3ALIST&YEAR=2019&MONTH=09&FROM=2712&TO=2712&STNM=80222'
peticion=requests.get(url)
soup=BeautifulSoup(peticion.content,"html.parser")

#get only the pre content I want

all=soup.select("pre")[0]

#write the content in a text file

with open('sound','w') as f:
    f.write(all.text)

#read it 
df = pandas.read_csv('sound')
df

I'm getting a not structured dataframe and since I have to do this with several urls I would rather to pass the data directly after the line 12 without the need of writing a file.

this is the dataframe I get


Solution

  • It is fixed width text so you need to generate the lines by splitting on '\n' and then the columns by using a fixed width value. You could use csv to save on overhead but you wanted a dataframe.

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup as bs
    
    r = requests.get('http://weather.uwyo.edu/cgi-bin/sounding?region=samer&TYPE=TEXT%3ALIST&YEAR=2019&MONTH=09&FROM=2712&TO=2712&STNM=80222')
    soup = bs(r.content, 'lxml')
    pre = soup.select_one('pre').text
    results = []
    
    for line in pre.split('\n')[1:-1]:
        if '--' not in line:
            row = [line[i:i+7].strip() for i in range(0, len(line), 7)]
            results.append(row)
    
    df = pd.DataFrame(results)
    print(df)