Search code examples
rweb-scrapingrvest

Rvest mailto with <a href="javascript:linkTo_UnCryptMailto(%27ocknvq%2Cjgkmg0qdgtnkpBwpk%5C%2Fvwgdkpigp0fg%27)


I would like to extract emails with rvest from this link However there is a javascript that masked the mailto href

How can I improve the following code?

 uni<- c("https://uni-tuebingen.de/fakultaeten/philosophische-fakultaet/fachbereiche/asien-orient-wissenschaften/indologie/mitarbeiter/")
  r<-read_html(uni) 
  a <- r %>%
    html_nodes("a") %>%
    html_attrs() %>%
    as.character() %>%
    str_subset("mailto:") %>%
    str_remove("mailto:")

Thanks in advance


Solution

  • def decryptCharcode(n, start, end, offset):
        n = ord(n) + offset
        if (offset > 0 and n > end):
            n = start + (n - end - 1)
        elif (offset < 0 and n < start):
            n = end - (start - n - 1)
    
        return ''.join(map(chr, [n]))
    
    
    
    def decryptString(enc, offset):
    
        dec = ""
    
        length = len(enc)
    
        for i in range(length-3):
    
            n = enc[i]
            if (0x2B <= ord(n) <= 0x3A):
                dec += decryptCharcode(n, 0x2B, 0x3A, offset)
            elif 0x40 <= ord(n) <= 0x5A:
                dec += decryptCharcode(n, 0x40, 0x5A, offset)
            elif (0x61 <= ord(n) <= 0x7A):
                dec += decryptCharcode(n, 0x61, 0x7A, offset)
            else:
                dec += enc[i]
    
        return dec
    
    email = "%27ocknvq%2Cuvqemgt0ygtpgtBdnwgykp0ej%27" 
    if "%27ocknvq%2C" in email:
        email = email.replace("%27ocknvq%2C","") 
    email = decryptString(email,-2)
    
    if "%3A%0D" in email:
        email=email.replace("%3A%0D","-") 
    

    print(email)

    I converted the JS code to python. Reference: https://gist.github.com/InsanityMeetsHH/c38f513f28d6f9b778912f110c565348