What I want is to keep domain names as single tokens. For ex: "https://www.twitter.com" should remain as a single token.
My code:
import nltk
from nltk.tokenize.regexp import RegexpTokenizer
line="My website: http://www.cartoon.com is not accessible."
pattern = r'^(((([A-Za-z0-9]+){1,63}\.)|(([A-Za-z0-9]+(\-)+[A-Za-z0-9]+){1,63}\.))+){1,255}$'
print (tokeniser.tokenize(line))
What am I doing wrong? Any better regex for domain names?
Edit: The special character must remain as a separate token, like from above example, tokenization must separate ('website' , ':').
You may use
See the regex demo
- leading word boundary (there must be a non-word char before...)(?:http|ftp)s?://
- a protocol, http
, ftp
- 0+ non-whitespace symbols\w
- a word char (=letter/digit/_
- or\w+
- 1 or more word chars|
- or [^\w\s]+
- 1 or more non-word chars excluding whitespaces.