Search code examples
pythonhtmlbeautifulsouphtml-parsing

How using two-part key for scarp html?


For example I have the next html code:

   ...   
        <tr class="main" data-year="Month">...</tr>
        <tr class="main" data-year="Month">...</tr>
        <tr class="main" data-year="Month">...</tr>
        ...
        <tr class="main" data-year="Month">...</tr>
        
          <td class="month" title="" data-x-key="name">June</td>
        
          <td class="month" title="" data-x-key="volume">100</td>
        
          <td class="month" title="" data-x-key="date">06/27/2022</td>
    
        </tr>
        ...
        <tr class="main" data-year="Month">...</tr>
    ...

and i have parsing code but I want to change it and my question is how can use the -> data-x-key and to not use duplicates -> find_next('td', class_='month')

    ...
        soup = BeautifulSoup(html, 'html.parser')
        item = soup.find_all('tr', class_='main')
        data = []
        for i in item:     
            data.append({
                        'name': i.find('td', class_='month').get_text(),
                        'volume': i.find('td', class_='month').find_next('td', class_='month').get_text(),
                        'date': i.find('td', class_='month').find_next('td', class_='month').find_next('td', 
                                class_='month').get_text()
                        })
        print(data)    
...

Solution

  • Try with CSS selectors

    html='''
    <tr class="main" data-year="Month">
        <td class="month" title="" data-x-key="name">June</td>
            
        <td class="month" title="" data-x-key="volume">100</td>
            
        <td class="month" title="" data-x-key="date">06/27/2022</td>
        
        </tr>
            
    '''
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html.parser')
    item = soup.find_all('tr', class_='main')
    #print(item)
    data = []
    for i in item:
        data.append({
            'name': i.select_one('td[data-x-key="name"]').get_text(),
            'volume':  i.select_one('td[data-x-key="volume"]').get_text(),
            'date': i.select_one('td[data-x-key="date"]').get_text()})
    print(data)  
    

    Output:

    [{'name': 'June', 'volume': '100', 'date': '06/27/2022'}]