Python: How to use string.translate() to replace quotation marks? (for "slug" creation)

I want to remove all strange characters from a string to make it "url safe". Therefor, I have a function that goes like this:

def urlize(url, safe=u''):
   intab =  u"àáâãäåòóôõöøèéêëçìíîïùúûüÿñ" + safe
   outtab = u"aaaaaaooooooeeeeciiiiuuuuyn" + safe
   trantab = dict((ord(a), b) for a, b in zip(intab, outtab))
   return url.lower().translate(trantab).strip()

This works just great, but now I want to reuse that funcion to allow special characters. For example, the quotation mark.

urlize(u'This is sóme randóm "text" that í wánt to process',u'"')

...and that throws the following error:

Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: expected a character buffer object

I have tried, but did not work:

intab =  u"àáâãäåòóôõöøèéêëçìíîïùúûüÿñ%s" , safe

--EDIT-- The full function looks like this

def urlize(url, safe=u''):

    intab =  u"àáâãäåòóôõöøèéêëçìíîïùúûüÿñ" + safe
    outtab = u"aaaaaaooooooeeeeciiiiuuuuyn" + safe
    trantab = dict((ord(a), b) for a, b in zip(intab, outtab))
    translated_url = url.lower().translate(trantab).strip()

    pos = 0
    stop = len(translated_url)
    new_url= ''
    last_division_char = False

    while pos < stop:
        if not translated_url[pos].isalnum() and translated_url[pos] not in safe:
            if (not last_division_char) and (pos != stop -1):
                last_division_char = True
            last_division_char = False

    return new_url

--EDIT-- Goal

What I want is to normalize text so that I can put it on the url myself, and use it like an Id. For example, if I want to show the products of a category, I'd rather put "ninos-y-bebes" instead of "niños-y-bebés" (spanish for kids and babies). I really don't want all the áéíóúñ (which are the special characters in spanish) in my url, but I don't want to get rid of them either. That's why I would like to replace all characters that looks the same (not 100% all of them, I dont care) and then delete all non alfanumeric characters left.


  • The unidecode module is a safer option (it will handle other special simbols like "degree"):

    >>> from unidecode import unidecode
    >>> s = u'This is sóme randóm "text" that í wánt to process'
    >>> unidecode(s)
    'This is some random "text" that i want to process'
    >>> import urllib
    >>> urllib.urlencode(dict(x=unidecode(s)))[2:]

    i think i'm already doing that -> u"aaaaaaooooooeeeeciiiiuuuuyn" – Marco Bruggmann

    Fair enough, if you are willing to keep track of every unicode character out there for your translation table (accented characters are not the only issues, there are a whole lot of symbols to rain on your parade).

    Worst, many unicode symbols may be visually identical to their ASCII counterparts, leading to hard to diagnose errors.

    What about something like:

    >>> safe_chars = 'abcdefghijklmnopqrstuvwxyz01234567890-_'
    >>> filter(lambda x: x in safe_chars, "i think i'm already doing that")

    @Daenyth I tried it, but I only get errors: from urllib import urlencode => urlencode('';) => TypeError: not a valid non-string sequence or mapping object – Marco Bruggmann

    The urlencode function is intended to produce QUERYSTRING formated output (a=1&b=2&c=3). It expects key/value pairs:

    >>> urllib.urlencode(dict(url=''))
    >>> help(urllib.urlencode)
    Help on function urlencode in module urllib:
    urlencode(query, doseq=0)
        Encode a sequence of two-element tuples or dictionary into a URL query string.
        If any values in the query arg are sequences and doseq is true, each
        sequence element is converted to a separate parameter.
        If the query arg is a sequence of two-element tuples, the order of the
        parameters in the output will match the order of parameters in the

    Ok, Marco, what you want is a routine to create the so called slugs, isn't it?

    You can do it in one line:

    >>> s = u'This is sóme randóm "text" that í wánt to process'
    >>> allowed_chars = 'abcdefghijklmnopqrstuwvxyz01234567890'
    >>> ''.join([ x if x in allowed_chars else '-' for x in unidecode(s.lower()) ])
    >>> s = u"Niños y Bebés"
    >>> ''.join([ x if x in allowed_chars else '-' for x in unidecode(s.lower()) ])
    >>> s = u"1ª Categoria, ½ docena"
    >>> ''.join([ x if x in allowed_chars else '-' for x in unidecode(s.lower()) ])