While performing a simple task of ip-address extraction, I found that the program is doing well. But in the complete program for web crawling it fail to survive and gives uneven results.
This is my code snippet for ip-address:
#!/usr/bin/python3
import os
import re
def get_ip_address(url):
command = "host " + url
process = os.popen(command)
results = str(process.read())
marker = results.find("has address") + 12
n = (results[marker:].splitlines()[0])
m = re.search('\w+ \w+: \d\([A-Z]+\)', n)
if m is not None:
url_new = url[8:]
command = "host " + url_new
process = os.popen(command)
results = str(process.read())
marker = results.find("has address") + 12
return results[marker:].splitlines()[0]
print(get_ip_address("https://www.yahoo.com"))
The complete program for web crawling looks like this:
#!/usr/bin/python3
from general import *
from domain_name import *
from ip_address import *
from nmap import *
from robots_txt import *
from whois import *
ROOT_DIR = "companies"
create_dir(ROOT_DIR)
def gather_info(name, url):
domain_name = get_domain_name(url)
ip_address = get_ip_address(url)
nmap = get_nmap('-F', ip_address)
robots_txt = get_robots_txt(url)
whois = get_whois(domain_name)
create_report(name, url, domain_name, nmap, robots_txt, whois, ip_address)
def create_report(name, full_url, domain_name, nmap, robots_txt, whois, ip_address):
project_dir = ROOT_DIR + '/' + name
create_dir(project_dir)
write_file(project_dir + '/full_url.txt', full_url)
write_file(project_dir + '/domain_name.txt', domain_name)
write_file(project_dir + '/nmap.txt', nmap)
write_file(project_dir + '/robots_txt.txt', robots_txt)
write_file(project_dir + '/whois.txt', whois)
write_file(project_dir + '/ip_address.txt', ip_address)
x = input("Enter the Company Name: ")
y = input("Enter the complete url of the company: ")
gather_info( x , y )
The input entered looks like this:
root@nitin-Lenovo-G580:~/Desktop/web_scanning# python3 main.py
106.10.138.240
Enter the Company Name: Yahoo
Enter the complete url of the company: https://www.yahoo.com/
/bin/sh: 1: Syntax error: "(" unexpected
And the output in ip_address.txt is:
hoo.com/ not found: 3(NXDOMAIN)
The program as seen runs well during runtime and gives ip as 106.10.138.240 still saving something different in ip_address.txt Also I failed to find out how this /bin/sh syntax error came. Please help me...
I second Joe Lin's suggestion to not use wildcards in your import statements. It pollutes your namespace greatly and may yield bizarre behavior.
Python is "batteries included" so you probably should leverage the requests
and urllib3
packages for HTTP requests, use subprocess
cautiously for executing commands, and checkout out the scrapy
package for web scraping. The data their respective objects and methods return may have what you are attempting to extract.
Be as lazy as possible and rely on "prior art."
In the first few lines of get_ip_address
I notice the following:
def get_ip_address(url):
command = "host " + url
process = os.popen(command)
....
If I executed this command via a shell, it would literally mirror this:
host http://www.foo.com
Doing a man host
and reading the man page:
host is a simple utility for performing DNS lookups. It is normally
used to convert names to IP addresses and vice versa. When no arguments
or options are given, host prints a short summary of its command line
arguments and options.
name is the domain name that is to be looked up. It can also be a
dotted-decimal IPv4 address or a colon-delimited IPv6 address, in which
case host will by default perform a reverse lookup for that address.
server is an optional argument which is either the name or IP address
of the name server that host should query instead of the server or
servers listed in /etc/resolv.conf.
You are providing host
a URL, when it is only wanting either an IP address or a hostname. URLs include the scheme, hostname, and path. You will have to extract the hostname explicitly to make host
work the way have chosen to interact with it. Given that URLs may/may not include detailing path info, you have to unravel it:
url= "http://www.yahoo.com/some_random/path"
# Split on "//" to extract scheme
_, host_and_path = url.split("//")
# Use .split() with maxsplit 1 to break this into pieces as desired
hostname , path = host_path.split("/", 1)
# # Use 'hostname' as input to the command
command = "host " + url
...
I do not believe the question is providing all of the code that is related to this problem. The error output appears to be shell-based, not a traditional Python stack trace, maybe one of the get_something
functions making use of Popen
to do some shell commands you desire.