Search code examples
pythonweb-crawlerpython-os

How to extract ip in web scanning


While performing a simple task of ip-address extraction, I found that the program is doing well. But in the complete program for web crawling it fail to survive and gives uneven results.

This is my code snippet for ip-address:

    #!/usr/bin/python3

    import os
    import re 

    def get_ip_address(url):
        command = "host " + url
        process = os.popen(command)
        results = str(process.read())
        marker = results.find("has address") + 12
        n = (results[marker:].splitlines()[0])
        m = re.search('\w+ \w+: \d\([A-Z]+\)', n)
        if m is not None:
            url_new = url[8:]
            command = "host " + url_new
            process = os.popen(command)
            results = str(process.read())
            marker = results.find("has address") + 12
            return results[marker:].splitlines()[0]

    print(get_ip_address("https://www.yahoo.com"))

The complete program for web crawling looks like this:

    #!/usr/bin/python3

    from general import *
    from domain_name import *
    from ip_address import *
    from nmap import * 
    from robots_txt import *
    from whois import *

    ROOT_DIR = "companies"
    create_dir(ROOT_DIR)

    def gather_info(name, url):
        domain_name = get_domain_name(url)
        ip_address = get_ip_address(url)
        nmap = get_nmap('-F', ip_address)
        robots_txt = get_robots_txt(url)
        whois = get_whois(domain_name)
        create_report(name, url, domain_name, nmap, robots_txt, whois, ip_address)

   def create_report(name, full_url, domain_name, nmap, robots_txt, whois, ip_address):
       project_dir = ROOT_DIR + '/' + name
       create_dir(project_dir)
       write_file(project_dir + '/full_url.txt', full_url)
       write_file(project_dir + '/domain_name.txt', domain_name)
       write_file(project_dir + '/nmap.txt', nmap)
       write_file(project_dir + '/robots_txt.txt', robots_txt)
       write_file(project_dir + '/whois.txt', whois)
       write_file(project_dir + '/ip_address.txt', ip_address)

    x = input("Enter the Company Name: ")
    y = input("Enter the complete url of the company: ")    
    gather_info( x , y )

The input entered looks like this:

    root@nitin-Lenovo-G580:~/Desktop/web_scanning# python3 main.py 
    106.10.138.240
    Enter the Company Name: Yahoo
    Enter the complete url of the company: https://www.yahoo.com/
    /bin/sh: 1: Syntax error: "(" unexpected

And the output in ip_address.txt is:

    hoo.com/ not found: 3(NXDOMAIN)

The program as seen runs well during runtime and gives ip as 106.10.138.240 still saving something different in ip_address.txt Also I failed to find out how this /bin/sh syntax error came. Please help me...


Solution

  • I second Joe Lin's suggestion to not use wildcards in your import statements. It pollutes your namespace greatly and may yield bizarre behavior.

    Python is "batteries included" so you probably should leverage the requests and urllib3 packages for HTTP requests, use subprocess cautiously for executing commands, and checkout out the scrapy package for web scraping. The data their respective objects and methods return may have what you are attempting to extract.

    Be as lazy as possible and rely on "prior art."

    In the first few lines of get_ip_address I notice the following:

    def get_ip_address(url):
        command = "host " + url
        process = os.popen(command)
        ....
    

    If I executed this command via a shell, it would literally mirror this:

    host http://www.foo.com
    

    Doing a man host and reading the man page:

       host is a simple utility for performing DNS lookups. It is normally
       used to convert names to IP addresses and vice versa. When no arguments
       or options are given, host prints a short summary of its command line
       arguments and options.
    
       name is the domain name that is to be looked up. It can also be a
       dotted-decimal IPv4 address or a colon-delimited IPv6 address, in which
       case host will by default perform a reverse lookup for that address.
       server is an optional argument which is either the name or IP address
       of the name server that host should query instead of the server or
       servers listed in /etc/resolv.conf.
    

    You are providing host a URL, when it is only wanting either an IP address or a hostname. URLs include the scheme, hostname, and path. You will have to extract the hostname explicitly to make host work the way have chosen to interact with it. Given that URLs may/may not include detailing path info, you have to unravel it:

    url= "http://www.yahoo.com/some_random/path"
    
    # Split on "//" to extract scheme
    _, host_and_path = url.split("//")
    
    # Use .split() with maxsplit 1 to break this into pieces as desired
    hostname , path = host_path.split("/", 1)
    
    # # Use 'hostname' as input to the command
    command = "host " + url
    ...
    

    I do not believe the question is providing all of the code that is related to this problem. The error output appears to be shell-based, not a traditional Python stack trace, maybe one of the get_something functions making use of Popen to do some shell commands you desire.