Search code examples
pythonpandasweb-scrapinghtml-parsing

How to parse a specific part of html table data using pandas


I have been learning how to scrape a web page using Pandas and I have hit a bit of a wall where I cant extract a specific piece of data that inside the itself.

Here is the html which is being parsed by Pandas:

<tr data-country="Bulgaria"> 
<td><i aria-hidden="true" class="                                    
    circle-country-flags-22 flags-22-bulgaria     display-inline-block"></i>
<a title="Bulgaria Economic Calendar" href="https://www.myfxbook.com/forex-economic- 
 calendar/bulgaria">Bulgaria</a></td>
<td>BNB</td>
<td> <a title="Bulgaria Interest Rates" href="https://www.myfxbook.com/forex-economic- 
calendar/bulgaria/interest-rate-decision">Bulgarian National Bank</a> </td> 
<td class="green"> 2.17% </td>
<td>1.82%</td>
<td> 35bp </td>
<td data-custom-date="2023-04-28 00:00:00.0">Apr 28, 2023</td>
<td data-custom-date="2023-05-29 10:00:00.0">1 day</td>
</tr>

And here is what my response array looks like:

{'Central Bank': 'Bulgarian National Bank',
  'Change': '35bp',
  'Country': 'Bulgaria',
  'Current Rate': '2.17%',
  'Last Meeting': 'Apr 28, 2023',
  'Next Meeting': '1 day',
  'Previous Rate': '1.82%',
  'Unnamed: 1': 'BNB'}

This is the line I am specifically looking at "1 day"

As I am trying to parse this "2023-05-29 10:00:00.0" into the response instead of "1 day"

Here is the code I have created for this so far:

import pandas as pd
import requests
import pprint
from datetime import datetime, timedelta


url = "https://www.myfxbook.com/forex-economic-calendar/interest-rates"

r = requests.get(url)
tables = pd.read_html(r.text) # this parses all the tables in webpages to a list
# Extract the first table from the list of parsed tables
parsed_table = tables[0]

# Convert DataFrame to list of dictionaries
list_of_dicts = parsed_table.to_dict(orient='records')

# Print the list of dictionaries

data = []

for row in list_of_dicts:
    data.append(row)


pp = pprint.PrettyPrinter(depth=4)
pp.pprint(data)

I have been scouring the interwebs but have not been able to find a solution so far as to how I do this so any help would be appreciated on this one.


Solution

  • Easy solution would be using a HTML parser (such as beautifulsoup) and replace the text of <td> tags. Then use pd.read_html to get the dataframe:

    import pprint
    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    url = "https://www.myfxbook.com/forex-economic-calendar/interest-rates"
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    
    # select all tags with data-custom-date= attribute
    for tag in soup.select('[data-custom-date]'):
        # replace the text of these tags with value of this attribute
        tag.string.replace_with(tag['data-custom-date'])
    
    parsed_table = pd.read_html(str(soup))[0]
    data = parsed_table.to_dict(orient="records")
    
    pp = pprint.PrettyPrinter(depth=4)
    pp.pprint(data)
    

    Prints:

    [{'Central Bank': 'Bulgarian National Bank',
      'Change': '35bp',
      'Country': 'Bulgaria',
      'Current Rate': '2.17%',
      'Last Meeting': '2023-04-28 00:00:00.0',
      'Next Meeting': '2023-05-29 10:00:00.0',
      'Previous Rate': '1.82%',
      'Unnamed: 1': 'BNB'},
     {'Central Bank': 'Central Bank of Kenya',
      'Change': '75bp',
      'Country': 'Kenya',
      'Current Rate': '9.5%',
      'Last Meeting': '2023-03-29 00:00:00.0',
      'Next Meeting': '2023-05-29 13:30:00.0',
      'Previous Rate': '8.75%',
      'Unnamed: 1': 'CBK'},
     {'Central Bank': 'National Bank of the Kyrgyz Republic',
      'Change': '0bp',
      'Country': 'Kyrgyzstan',
      'Current Rate': '13.0%',
    
    ...and so on.