Search code examples
perllinkedin-api

How to retrieve linkedin profiles via api for accented urls?


I'm trying to get info from LinkedIn API, but i run into some issues when the urls have any kind of accented characters.

For non-accented urls the call to the API works fine and i can retrieve data without problems, but when i try with accented urls i get an error.

I have tried escaping the url but it doesn't work:

uri_escape_utf8:

'https://api.linkedin.com/v1/people/url=' . uri_escape_utf8('xxxxx');

uri_escape:

'https://api.linkedin.com/v1/people/url=' . uri_escape('xxxxx');

no escaping:

'https://api.linkedin.com/v1/people/url=xxxxx';

double escape:

uri_escape_utf8('https://api.linkedin.com/v1/people/url=' . uri_escape_utf8('xxxxx'));

Solution

  • Update

    I'm pretty sure the problem will be that you don't have use utf8 at the top of your program. This code correctly encodes the i-diaresis as %C3%AF and the e-acute as %C3%A9

    use utf8;
    use strict;
    use warnings 'all';
    use feature 'say';
    
    use URI::Escape qw/ uri_escape_utf8 /;
    
    say uri_escape_utf8('http://linkedin.com/in/anaïs-thévoz-b070838');
    

    output

    http%3A%2F%2Flinkedin.com%2Fin%2Fana%C3%AFs-th%C3%A9voz-b070838
    

    Whereas without the use utf8, Perl is seeing the UTF-8-encoded bytes instead of characters, like this

    "http://linkedin.com/in/ana\xC3\xAFs-th\xC3\xA9voz-b070838"
    

    and uri_escape_utf8 double-encodes "\xC3\xAF" as %C3%83%C2%AF and "\xC3\xA9" as %C3%83%C2%A9 like this

    output

    http%3A%2F%2Flinkedin.com%2Fin%2Fana%C3%83%C2%AFs-th%C3%83%C2%A9voz-b070838
    

    so the LinkedIn server gets confused



    URLs use only eight-bit octets and there is no assumed encoding for Unicode characters

    RFC 3986 is the current standard for Uniform Resource Identifiers (URIs), and Section 2 -- Characters -- explains that the only characters allowed in a URL are the special delimiters !, #, $, &, ', (, ), *, +, ,, /, :, ;, =, ?, @, [, ] in addition to the unreserved characters that can be used to build identifiers which match the regex pattern [0-9A-Za-z._~-]

    You can extend this restriction by using the percent sign % followed by two hex digits to represent any octet without its special meaning, but this doesn't cover multi-byte characters, and there is no implied encoding if they are used within a URL.

    If you are using URI::Escape then uri_escape_utf8 will correctly encode any string in UTF-8 as a combination of unreserved and percent-encoded characters, but the server must be expecting a utf-8-encoded URL

    The most likely problems are

    • Your original string is already encoded and contains encoded bytes instead of characters, so uri_escape_utf8 is encoding an encoded string

    • The LinkedIn API doesn't expect UTF-8-encoded URLs