I'm trying to get info from LinkedIn API, but i run into some issues when the urls have any kind of accented characters.
For non-accented urls the call to the API works fine and i can retrieve data without problems, but when i try with accented urls i get an error.
I have tried escaping the url but it doesn't work:
uri_escape_utf8:
'https://api.linkedin.com/v1/people/url=' . uri_escape_utf8('xxxxx');
uri_escape:
'https://api.linkedin.com/v1/people/url=' . uri_escape('xxxxx');
no escaping:
'https://api.linkedin.com/v1/people/url=xxxxx';
double escape:
uri_escape_utf8('https://api.linkedin.com/v1/people/url=' . uri_escape_utf8('xxxxx'));
I'm pretty sure the problem will be that you don't have use utf8
at the top of your program. This code correctly encodes the i-diaresis as %C3%AF
and the e-acute as %C3%A9
use utf8;
use strict;
use warnings 'all';
use feature 'say';
use URI::Escape qw/ uri_escape_utf8 /;
say uri_escape_utf8('http://linkedin.com/in/anaïs-thévoz-b070838');
http%3A%2F%2Flinkedin.com%2Fin%2Fana%C3%AFs-th%C3%A9voz-b070838
Whereas without the use utf8
, Perl is seeing the UTF-8-encoded bytes instead of characters, like this
"http://linkedin.com/in/ana\xC3\xAFs-th\xC3\xA9voz-b070838"
and uri_escape_utf8
double-encodes "\xC3\xAF"
as %C3%83%C2%AF
and "\xC3\xA9"
as %C3%83%C2%A9
like this
http%3A%2F%2Flinkedin.com%2Fin%2Fana%C3%83%C2%AFs-th%C3%83%C2%A9voz-b070838
so the LinkedIn server gets confused
URLs use only eight-bit octets and there is no assumed encoding for Unicode characters
RFC 3986 is the current standard for Uniform Resource Identifiers (URIs), and Section 2 -- Characters -- explains that the only characters allowed in a URL are the special delimiters !
, #
, $
, &
, '
, (
, )
, *
, +
, ,
, /
, :
, ;
, =
, ?
, @
, [
, ]
in addition to the unreserved characters that can be used to build identifiers which match the regex pattern [0-9A-Za-z._~-]
You can extend this restriction by using the percent sign %
followed by two hex digits to represent any octet without its special meaning, but this doesn't cover multi-byte characters, and there is no implied encoding if they are used within a URL.
If you are using URI::Escape
then uri_escape_utf8
will correctly encode any string in UTF-8 as a combination of unreserved and percent-encoded characters, but the server must be expecting a utf-8-encoded URL
The most likely problems are
Your original string is already encoded and contains encoded bytes instead of characters, so uri_escape_utf8
is encoding an encoded string
The LinkedIn API doesn't expect UTF-8-encoded URLs