I'm building a web server with Python-tornado. The server is to provide a kind of search service about all of restaurants in some country. So the logic is quite simple: user types a key word and submits on the web page, the server replies some messages. In a word, it is just like a mini-google.
I also make a simple log.
In the log, I can see that most of requests are like this:
[I 170625 19:23:12 web:2063] 200 GET /images/icon-language.png (116.31.83.132) 0.88ms
[I 170625 19:23:12 web:2063] 200 GET /index?type=Sight&key=Bol%20content (116.31.83.132) 10.05ms
[I 170625 19:30:30 web:2063] 304 GET / (116.31.83.132) 0.87ms
[I 170625 19:30:44 web:2063] 200 GET / (116.31.83.132) 0.78ms
[W 170625 19:30:51 web:2063] 405 POST / (116.31.83.132) 1.20ms
[W 170625 19:31:00 web:2063] 405 POST / (116.31.83.132) 0.63ms
[I 170625 19:31:22 web:2063] 200 POST /index (116.31.83.132) 0.89ms
[I 170625 19:31:42 web:2063] 200 GET /index (116.31.83.132) 0.62ms
[I 170625 19:31:49 web:2063] 200 GET / (116.31.83.132) 0.78ms
[W 170625 19:31:57 web:2063] 404 GET /abce (116.31.83.132) 0.65ms
But to my surprise, there are a few of requests as below:
[W 170625 18:43:41 web:2063] 404 GET http://baidu.com/ (106.2.125.215) 0.60ms
I can't understand how this kind of request is generated.
For example, if the address of my web server is www.example.com
and I send some get request to it, it must be like this: www.example.com/abcd
. But this request doesn't start with /
, how comes?
Is this some kind of XSS(Cross Site Scripting)? It seems that someone was trying to do some Cross-Origin request through my web server. If I'm right, I'm gonna filter all key words of user containing <script>
. Am I right?
It seems to me that somebody mixed your server with baidu.com. Or your server have some connections with them and request bounced up to you because of poorly set DNS or such stuff. It is just possible that somebody programming misstyped IP address for baidu.com and got your server instead.
I hope you know how HTTP requests do look like and that making a call to an IP isn't enough for a professional web server. You have to look at "Host" HTTP header too. I don't know whether tornado does this by default. But when Host header isn't your websites URL, you drop the connection and no mixes occur.
And you are wrong. <script> has nothing to do with server side of HTTP protocol and has nothing what so ever with direct effect to it. Do not mix HTML and JS with HTTP. They have in common just that HTTP's most usual transfers are HTML pages and JS scripts.
Ow, BTW, It would be clever of you to include information from HTTP header "User-Agent" into a log and, you can check who gets to you to some degree by using whois and similar services.