Search code examples
pythonhtmlfunctionweb-scraping

Web Scraping coded price


While web scraping an article the price was in the elements but not in in resources. instead there is the following coded text

<script>
var f3699334f586f4f2bb6edc10899026d63 = function(value) { 
    return base64UTF8Codec.decode(arguments[0])
};

replaceWith(
    document.getElementById('9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'), 
    f3699334f586f4f2bb6edc10899026d63('CiAgICAgICAgICAgICAgICA8c3BhbiBjbGFzcz0icHVsbC1yaWdodCI+IDIuNTkwLC0gPC9zcGFuPgogICAgICAgICAgICA=')
);
</script>

How can I decode the text into price ?

enter image description here

enter image description here


Solution

  • The text is base64 encoded. If you can locate with beautifulsoup the right <script> tag, you can extract the right information with re module:

    import re
    import base64
    from bs4 import BeautifulSoup
    
    txt = '''<script>
    var f3699334f586f4f2bb6edc10899026d63 = function(value){return base64UTF8Codec.decode(arguments[0])};
    replaceWith(document.getElementById('9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'), f3699334f586f4f2bb6edc10899026d63('CiAgICAgICAgICAgICAgICA8c3BhbiBjbGFzcz0icHVsbC1yaWdodCI+IDIuNTkwLC0gPC9zcGFuPgogICAgICAgICAgICA='));
    </script>'''
    
    soup = BeautifulSoup(txt, 'html.parser')
    
    # 1. locate the right <script> tag
    script = soup.script
    
    # 2. get coded text from the script tag
    coded_text = re.findall(r".*\('(.*?)'\)\);", script.text)[0]
    
    # 3. decode the text
    decoded_text = base64.b64decode(coded_text)  # b'\n                <span class="pull-right"> 2.590,- </span>\n            '
    
    # 4. get the price from the decoded text
    soup2 = BeautifulSoup(decoded_text, 'html.parser')
    
    print(soup2.span.get_text(strip=True))
    

    Prints:

    2.590,-