Search code examples
emailparsingjsoup

How to parse [email protected] data with JSOUP


Is there a way to parse an email address by JSOUP which is protected by this piece of code:

<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="29484e4a404a4c50404469484e4a404a4c504044074a4644">[email&#160;protected]</a>

While parsing with standard elements.select(".email").text(); it returns [email protected]. I tried to google this but found a lot of unrelated info.


Solution

  • The email address is "encrypted" by XORing every character in the email address with some randomly generated first byte. Decode the hex string into a byte array and XOR all of the bytes with the first one to decrypt the address.

    For example (in Python):

    In [1]: cfemail = '29484e4a404a4c50404469484e4a404a4c504044074a4644'
    
    In [2]: encoded_bytes = bytes.fromhex(cfemail)
    
    In [3]: encoded_bytes
    Out[3]: b')HNJ@JLP@DiHNJ@JLP@D\x07JFD'
    
    In [4]: bytes(byte ^ encoded_bytes[0] for byte in encoded_bytes[1:])
    Out[4]: b'[email protected]'