What is regex for website domain to use in tokenizing while keeping punctuation apart from words?

This is normal output:

What I want is to keep domain names as single tokens. For ex: "https://www.twitter.com" should remain as a single token.

My code:

import nltk
from nltk.tokenize.regexp import RegexpTokenizer

line="My website: http://www.cartoon.com is not accessible."
pattern = r'^(((([A-Za-z0-9]+){1,63}\.)|(([A-Za-z0-9]+(\-)+[A-Za-z0-9]+){1,63}\.))+){1,255}$'
tokeniser=RegexpTokenizer(pattern)

print (tokeniser.tokenize(line))

Output:

[]

What am I doing wrong? Any better regex for domain names?

Edit: The special character must remain as a separate token, like from above example, tokenization must separate ('website' , ':').

Solution

You may use

tokeniser=RegexpTokenizer(r'\b(?:http|ftp)s?://\S*\w|\w+|[^\w\s]+')

See the regex demo

Details:

\b - leading word boundary (there must be a non-word char before...)
(?:http|ftp)s?:// - a protocol, http/https, ftp/ftps
\S* - 0+ non-whitespace symbols
\w - a word char (=letter/digit/_)
| - or
\w+ - 1 or more word chars
| - or
[^\w\s]+ - 1 or more non-word chars excluding whitespaces.