Search code examples
phppermalinks

Generate permalink to a blog post Hindi PHP


I have one form in which following inputs are taken from user:

  • Blog title
  • Blog Description
  • Permalink to access blog

I am converting Blog title to lower case and replacing white spaces with dash(-) and storing it in Permalink to access blog .
Below is the code to handle this operation:

setlocale(LC_ALL, 'en_US.UTF8');

function toAscii($str, $replace=array(), $delimiter='-') {
  if( !empty($replace) ) {
     $str = str_replace((array)$replace, ' ', $str);
  }
     $clean = iconv('UTF-8', 'ASCII//TRANSLIT', $str);
     $clean = preg_replace("/[^a-zA-Z0-9\/_|+ -]/", '', $clean);
     $clean = strtolower(trim($clean, '-'));
     $clean = preg_replace("/[\/_|+ -]+/", $delimiter, $clean);
     return $clean;
}    

$prmlkn = toAscii($blog_headline, $replace=array(), $delimiter='-');

This code all works fine till Blog headline is in English. But if user types in Hindi then i am only getting - as permalink means it is not recognizing Hindi POST values.


Solution

  • This happens because Hindi uses the extended character set in UTF-8 and you are converting to ASCII that only provides latin characters, thus:

    $str = "नमस्ते"
    $clean = iconv('UTF-8', 'ASCII//TRANSLIT', $str); // clean is an empty string ""
    

    According to rfc3986

    1. Characters

    ...

    The ABNF notation defines its terminal values to be non-negative
    integers (codepoints) based on the US-ASCII coded character set
    [ASCII]. Because a URI is a sequence of characters, we must invert
    that relation in order to understand the URI syntax. Therefore, the

    integer values used by the ABNF must be mapped back to their
    corresponding characters via US-ASCII in order to complete the syntax rules.

    A URI is composed from a limited set of characters consisting of
    digits, letters, and a few graphic symbols. A reserved subset of
    those characters may be used to delimit syntax components within a
    URI while the remaining characters, including both the unreserved set and those reserved characters not acting as delimiters, define each
    component's identifying data.

    You might be better off using urlencode() but note this might make a really ugly and long permalink

    $str = "नमस्ते hello";
    $clean = urlencode("$str");
    printf("%s",$clean);
    

    would result in a valid but ulgy:

    %E0%A4%A8%E0%A4%AE%E0%A4%B8%E0%A5%8D%E0%A4%A4%E0%A5%87+hello