Search code examples
pythontokenize

tokenize the url


I have a sentences (logs) look like that ['GET http://10.0.0.0:1000/ HTTP/X.X'] I want to have it in this form :

['GET', 'http://10.0.0.0:1000/', 'HTTP/X.X'] 

but that's not the fall. i've used this code :

import re

sentences = ['GET http://10.0.0.0:1000/ HTTP/X.X']
rx = re.compile(r'\b(\w{3})\s+(\d{1,2})\s+(\d{2}:\d{2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)')

words=[]
for sent in sentences:
    m = rx.search(sent)
    if m:
        words.append(list(m.groups()))
    else:
        words.append(nltk.word_tokenize(sent))  

print(words)

i get as an output :

[['GET', 'http', ':', '//10.0.0.0:1000/', 'HTTP/X.X']]

can someone know where is the error, or why it's not working as i want . Thank you


Solution

  • import re
    
    sentences = ['GET http://10.0.0.0:1000/ HTTP/X.X']
    
    words=[]
    
    for sent in sentences:
        words.append(list(sent.split(' ')))
    
    print(words)
    

    Can you use a simple space split? I think nltk.word_tokenize is given you the wrong output!