What I want is to keep domain names as single tokens. For ex: "https://www.twitter.com" should remain as a single token.
My code:
import nltk
from nltk.tokenize.regexp import RegexpTokenizer
line="My website: http://www.cartoon.com is not accessible."
pattern = r'^(((([A-Za-z0-9]+){1,63}\.)|(([A-Za-z0-9]+(\-)+[A-Za-z0-9]+){1,63}\.))+){1,255}$'
tokeniser=RegexpTokenizer(pattern)
print (tokeniser.tokenize(line))
Output:
[]
What am I doing wrong? Any better regex for domain names?
Edit: The special character must remain as a separate token, like from above example, tokenization must separate ('website' , ':').
You may use
tokeniser=RegexpTokenizer(r'\b(?:http|ftp)s?://\S*\w|\w+|[^\w\s]+')
See the regex demo
Details:
\b
- leading word boundary (there must be a non-word char before...)(?:http|ftp)s?://
- a protocol, http
/https
, ftp
/ftps
\S*
- 0+ non-whitespace symbols\w
- a word char (=letter/digit/_
)|
- or\w+
- 1 or more word chars|
- or [^\w\s]+
- 1 or more non-word chars excluding whitespaces.