Search code examples
pythonparsingweb-scrapingbeautifulsouptext-parsing

How to extract text from this specific page ? Unable to do so with bs4+python


I have the following page:

http://greyhoundbet.racingpost.com/#card/race_id=1632746&r_date=2018-08-17&tab=form

It contains a series of information organized in "tables". I need to "extract" that information (rows and columns) to manipulate the info later.

Knowing that I'm a newbie, i tried to do it with bs4 with python but I wasn't successful. What would you recomend ?

1) Should I use a program language that would allow me to read the text from the page (which one should I use ? what sould I look for?) and then manipulate it ?

2) Can I get the text manually (ctrl+c) and send it to python somehow ?


How would you get the info from the page in the easiest way to later work with the data ?

Thank you all and I'm sorry if this is a dumb question. I've been struggling with that for the past week.

Regards, P.

EDIT: I was thinking in use an object oriented approach to separate every greyhound and to study each number. Maybe its better to do it in C# ?


Solution

    1. I would suggest either Selenium with Python bindings https://selenium-python.readthedocs.io/, or CasperJS (http://casperjs.org/) which is based on phantomjs. The second is written in Javascript.
    2. Create a text file and paste the copied text. Then read the file with python :

      with open('page_text.txt') as f: lines = f.readlines()

    You cannot scrape the page with bs4. You need a 'headless browser', a tool that can load dynamic webpages (like Selenium etc)