Search code examples
pythonstringbeautifulsoupbitcoin

Parsing bitcoin address in html


I'm trying to use the beautiful soup lib to parse a webpage and find bitcoin addresses.

I've managed to pull the class containing a generated address out of the whole html document:

<div class="roundpic qrcode" data-height="80" data-text="bitcoin:1JL7kugm1vDLqyzrVPAPdcbjH3PTxcPcud?amount=0.0573" data-width="80" style="margin: auto"></div>, <div class="roundpic qrcode" data-height="160" data-text="bitcoin:1JL7kugm1vDLqyzrVPAPdcbjH3PTxcPcud?amount=0.0573" data-width="160" style="padding: 10px"></div>

What would be the best way to isolate the address? I know the length can be between 27-34 characters but it will always appear in between 'bitcoin:' and '?'. Is there a regex I could use?

Thanks


Solution

  • You don't really need a regex. Basic string operations work just fine:

    import re
    from bs4 import BeautifulSoup
    
    html = '''
    <div class="roundpic qrcode" data-height="80" data-text="bitcoin:1JL7kugm1vDLqyzrVPAPdcbjH3PTxcPcud?amount=0.0573" data-width="80" style="margin: auto"></div>
    <div class="roundpic qrcode" data-height="160" data-text="bitcoin:1JL7kugm1vDLqyzrVPAPdcbjH3PTxcPcud?amount=0.0573" data-width="160" style="padding: 10px"></div>
    '''
    
    soup = BeautifulSoup(html)
    
    for div in soup.find_all('div', {'data-text': re.compile(r'^bitcoin:')}):
        address, amount = div.get('data-text').replace('bitcoin:', '').split('?amount=')
    

    soup.find_all('div', {'data-text': re.compile(r'^bitcoin:')}) finds all <div> elements where the data-text attribute's value starts with bitcoin:. You could have also used:

    soup.find_all('div', {'data-text': lambda value: value.startswith('bitcoin:')})