Search code examples
pythonregexnltktokenize

What is regex for website domain to use in tokenizing while keeping punctuation apart from words?


This is normal output: enter image description here

What I want is to keep domain names as single tokens. For ex: "https://www.twitter.com" should remain as a single token.

My code:

import nltk
from nltk.tokenize.regexp import RegexpTokenizer

line="My website: http://www.cartoon.com is not accessible."
pattern = r'^(((([A-Za-z0-9]+){1,63}\.)|(([A-Za-z0-9]+(\-)+[A-Za-z0-9]+){1,63}\.))+){1,255}$'
tokeniser=RegexpTokenizer(pattern)

print (tokeniser.tokenize(line))

Output:

[]

What am I doing wrong? Any better regex for domain names?

Edit: The special character must remain as a separate token, like from above example, tokenization must separate ('website' , ':').


Solution

  • You may use

    tokeniser=RegexpTokenizer(r'\b(?:http|ftp)s?://\S*\w|\w+|[^\w\s]+')
    

    See the regex demo

    Details:

    • \b - leading word boundary (there must be a non-word char before...)
    • (?:http|ftp)s?:// - a protocol, http/https, ftp/ftps
    • \S* - 0+ non-whitespace symbols
    • \w - a word char (=letter/digit/_)
    • | - or
    • \w+ - 1 or more word chars
    • | - or
    • [^\w\s]+ - 1 or more non-word chars excluding whitespaces.