I use pywhois to determine if a domain name is registered or not. Here is my source code. (all permutations from a.net
to zzz.net
)
#!/usr/bin/env python
import whois #pip install python-whois
import string
import itertools
def main():
characters = list(string.ascii_lowercase)
##domain names generator
for r in range(1, 4) :
for name in itertools.permutations(characters, r) : #from 'a.net' to 'zzz.net'
url = ''.join(name) + '.net'
#check if a domain name is registered or not
try :
w = whois.whois(url)
except (whois.parser.PywhoisError): #NOT FOUND
print(url) #unregistered domain names?
if __name__ == '__main__':
main()
I got the following results:
jv.net
uli.net
vno.net
xni.net
However, all above domain names have already been registered. It is not accurate. Can anyone explain it? There are a lot of errors:
fgets: Connection reset by peer
connect: No route to host
connect: Network is unreachable
connect: Connection refused
Timeout.
There is an alternative way, reported here.
import socket
try:
socket.gethostbyname_ex(url)
except:
print(url) #unregistered domain names?
In speaking of speed, I use map
to parallel processing.
def select_unregisteredd_domain_names(self, domain_names):
#Parallelism using map
pool = ThreadPool(16) # Sets the pool size
results = pool.map(query_method(), domain_names)
pool.close() #close the pool and wait for the work to finish
pool.join()
return results
This is a tricky problem to solve, trickier than most people realize. The reason for that is that some people don't want you to find that out. Most domain registrars apply lots of black magic (i.e. lots of TLD-specific hacks) to get the nice listings they provide, and often they get it wrong. Of course, in the end they will know for sure, since they have EPP access that will hold the authoritative answer (but it's usually done only when you click "order").
Your first method (whois) used to be a good one, and I did this on a large scale back in the 90s when everything was more open. Nowadays, many TLDs protect this information behind captchas and obstructive web interfaces, and whatnot. If nothing else, there will be quotas on the number of queries per IP. (And it may be for good reason too, I used to get ridiculous amounts of spam to email addresses used for registering domains). Also note that spamming their WHOIS databases with queries is usually in breach of their terms of use and you might get rate limited, blocked, or even get an abuse report to your ISP.
Your second method (DNS) is usually a lot quicker (but don't use gethostbyname, use Twisted or some other async DNS for efficiency). You need to figure out how the response for taken and free domains look like for each TLD. Just because a domain doesn't resolve doesn't mean its free (it could just be unused). And conversely, some TLDs have landing pages for all nonexisting domains. In some cases it will be impossible to determine using DNS alone.
So, how do you solve it? Not with ease, I'm afraid. For each TLD, you need to figure out how to make clever use of DNS and whois databases, starting with DNS and resorting to other means in the tricky cases. Make sure not to flood whois databases with queries.
Another option is to get API access to one of the registrars, they might offer programmatic access to domain search.