Search code examples

Issue scraping dynamic table data from Human Microbiom Project (HMP) using python beautiful soup and selenium

I am working to web scrape the dynamic table data from the 'File UUID' column from the HMP website using python (beautiful soup and selenium). For some reason, I am able to pull all the data from the HMP website dynamic table except the column I need. It is not showing up for some reason. Below is my python code I am running. Let me know what the issue may be or if there is a better approach to getting this data.

from bs4 import BeautifulSoup
from selenium import webdriver
import time

import numpy as np
import pandas as pd

# establishing connection to hmp main website and parsing hmp data table information
url = ',%2216s_community%22%5D%20and%20sample.body_site%20in%20%5B%22feces%22%5D&filters=%7B%22op%22:%22and%22,%22content%22:%5B%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22file.matrix_type%22,%22value%22:%5B%22wgs_community%22,%2216s_community%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22sample.body_site%22,%22value%22:%5B%22feces%22%5D%7D%7D%5D%7D#:~:text=Samples%20(3%2C452)-,Files%20(5%2C181),-files'
browser = webdriver.Chrome()
html = browser.page_source
hmp_parsed_page = BeautifulSoup(html, "lxml")
hmp_files_table = hmp_parsed_page.find('table', id='files-table')

# gathering hmp meta datatable column headers
hmp_metadata_fields = []
for th in hmp_files_table.find_all('th'):
    col_header = th.text

# creating dataframe of hmp information scraped
hmp_metadata_df = pd.DataFrame(columns = hmp_metadata_fields)

# appending hmp row data to dataframe
for tr in hmp_files_table.find_all('tr')[1:]:
    row_data = tr.find_all('td')
    row = [data_point.text for data_point in row_data]
    hmp_metadata_df.loc[len(hmp_metadata_df.index)] = row

# dropping unneeded columns
hmp_metadata_df = hmp_metadata_df.drop(hmp_metadata_df.columns[[0,1]], axis = 1)

# adding hmp indicator to front of dataframe
hmp_metadata_df['Data Source'] = 'HMP'

# closing hmp website connection

I have tried all different manners of screen scraping this table data from HMP with no luck. I am expecting the outputs to be a table of all the columns and rows shown on the website. For some reason it is not showing. When I look up each element in the table using inspect it shows 'File UUID' is there under the 'files-table'.

<th title="File UUID" ng-repeat="h in tsc.headings" ng-class="{
              'sortable': h['sortable'],
              'sort-asc': tsc.tableParams.sorting()[h['sortable']]=='asc',
              'sort-desc': tsc.tableParams.sorting()[h['sortable']]=='desc'
            }" ng-click="tsc.sortByCol(h, $event)" ng-if="" class="header ng-scope sortable" role="button" tabindex="0" style=""><div class="ng-table-header " ng-class="{'sort-indicator': tsc.tableParams.defaultSettings.sortingIndicator == 'div'}"><span data-cell="tsc.getHeaderCell(h)" data-data="data" data-paging="paging" ng-class="{'sort-indicator': tsc.tableParams.defaultSettings.sortingIndicator == 'span'}" class="ng-isolate-scope sort-indicator">File UUID</span></div></th>


  • You can use directly their Ajax API to obtain the data (the UUID is I believe in id column):

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    url = ""
    params = {
        "fields": "file_format,file_type,file_annotation_pipeline,file_matrix_type",
        "filters": '{"op":"and","content":[{"op":"in","content":{"field":"file.matrix_type","value":["wgs_community","16s_community"]}},{"op":"in","content":{"field":"sample.body_site","value":["feces"]}}]}',
        "from": 0,
        "save": "",
        "size": "20",
        "sort": "file_id:asc",
    all_dfs = []
    for params['from'] in range(0, 40, 20): # <--- increase the range for next pages
        data = requests.get(url, params=params).json()
        all_dfs.append(pd.DataFrame([h['file'] for h in data['data']['hits']]))
    df = pd.concat(all_dfs).reset_index(drop=True)


                     format_doc        study  ver organism_type                         format    data_modality         node_type    size        subtype                                                                                                                    fasp  data_type    matrix_type abundance_type                                                                                                                            https                                id                               md5                                                                                                                        file_name access                                                                                                comment
    35  prediabetes  NaN     bacterial  Biological Observation Matrix  marker sequence  abundance_matrix  196000  16s_community    fasp://  abundance  16s_community      community  76612bd9a41885add4f6b0b7683a65da  70600351056001048c1d42d7268cc6b7   open    Qiime output upload from DCC for HMP2_J45372_1_ST_T0_B0_0120_ZY39SN0-02_APB4D.clean.dehost.fastq.gz
    36  prediabetes  NaN     bacterial  Biological Observation Matrix  marker sequence  abundance_matrix  196000  16s_community  fasp://  abundance  16s_community      community  76612bd9a41885add4f6b0b76836df9b  39643700bd4bcf040064c12f1d2b644c   open  Qiime output upload from DCC for HMP2_J45281_1_ST_T0_B0_0120_ZRB0F6P-6021_APATM.clean.dehost.fastq.gz
    37  prediabetes  NaN     bacterial  Biological Observation Matrix  marker sequence  abundance_matrix   81000  16s_community    fasp://  abundance  16s_community      community  6cca313bce90a4392c3d5cf23fdb7ca8  7a33c9809cb98fac4e89aa2d3c151597   open    Qiime output upload from DCC for HMP2_J04182_1_ST_T0_B0_0122_ZN0JE53-04_AAH7B.clean.dehost.fastq.gz
    38  prediabetes  NaN     bacterial  Biological Observation Matrix  marker sequence  abundance_matrix  204000  16s_community                                       fasp://  abundance  16s_community      community                               76612bd9a41885add4f6b0b7681567ac  7a33c9809cb98fac4e89aa2d3c151597                                open    Qiime output upload from DCC for HMP2_J04182_1_ST_T0_B0_0122_ZN0JE53-04_AAH7B.clean.dehost.fastq.gz
    39  prediabetes  NaN     bacterial  Biological Observation Matrix  marker sequence  abundance_matrix  120000  16s_community    fasp://  abundance  16s_community      community  6cca313bce90a4392c3d5cf23fdafbcc  9757b64815cbfee3ba188e80b69a023e   open    Qiime output upload from DCC for HMP2_J00840_1_ST_T0_B0_0120_ZLZNCLZ-01_AA31J.clean.dehost.fastq.gz