I have a sentences (logs) look like that ['GET http://10.0.0.0:1000/ HTTP/X.X']
I want to have it in this form :
['GET', 'http://10.0.0.0:1000/', 'HTTP/X.X']
but that's not the fall. i've used this code :
import re
sentences = ['GET http://10.0.0.0:1000/ HTTP/X.X']
rx = re.compile(r'\b(\w{3})\s+(\d{1,2})\s+(\d{2}:\d{2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)')
words=[]
for sent in sentences:
m = rx.search(sent)
if m:
words.append(list(m.groups()))
else:
words.append(nltk.word_tokenize(sent))
print(words)
i get as an output :
[['GET', 'http', ':', '//10.0.0.0:1000/', 'HTTP/X.X']]
can someone know where is the error, or why it's not working as i want . Thank you
import re
sentences = ['GET http://10.0.0.0:1000/ HTTP/X.X']
words=[]
for sent in sentences:
words.append(list(sent.split(' ')))
print(words)
Can you use a simple space split? I think nltk.word_tokenize is given you the wrong output!