I have a few millions IPv4 addresses in a .txt
file, like so:
x.y.z.w
x.y.z.w
x.y.z.w
...
My goal is to check for each address, if there's a real website behind it or the address is fake.
I've seen posts only dealing with URLs (not addresses), and indeed I tried the methods described to reverse-DNS the IP address first to URL, and then use it to determine whether the website exists or not. However, it takes about 2 seconds for each address, which means a few months for all of them, and of course I don't have that time.
What's the best, fastest way to do it?
I highly prefer Python, but could using C speed things up significantly?
Thanks.
Unless these websites are virtually hosted, IP addresses are not any different from hostnames. But in case of virtual hosting, using a reverse-DNS won't help you as many sites could be hosted on the same IP address, and the one you'll query might be down at the moment. Also, not all websites will be registered in the reverse DNS records, so you might miss some.
I don't know what method you are using to determine if a website is hosted at an address, but whatever it is, doing it is probably IO bound and not CPU bound. That means that using C will probably yield insignificant improvement in performance, as the program will spend most of the time waiting for response from the websites anyway.
What you can do to improve performance is:
Decrease timeouts. If you are using the default timeouts for network operations, you might find yourself waiting for responses more than you want.
Parallelize tasks. Try using the threading
or asyncio
modules. They are built to allow parallelization of tasks, and asyncio
is specifically meant to do so for IO bound programs.
Also, consider using tools that already have these features implemented, like nmap
for example.