Search code examples
pythonseleniumweb-scrapingbeautifulsoupweb-search

Automate web search without API


I'm trying to automatically pull information from this website for a set of values. I have a list of start and destination ports e.g. THEODOSIA and KERCH and I need to extract the calculated distance, speed and days for each start-destination combination. Can someone please advise on how to achieve this in Python? Another potential hurdle is that the ports in my list have 'short names' e.g. THEODOSIA which stands for Port of Theodosia, Ukraine. When you enter THEODOSIA in the search the website offers an auto-complete suggestion so that's fine for a manual search. However, I'm not sure how that works in automated searches.

I'm completely inexperienced in web scraping/searching so started writing the below code after reading a few online articles but have reached a dead end and don't think my code is of any use.

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
from webdriver_manager.chrome import ChromeDriverManager
import requests

#Example start and destination port values
df = pd.DataFrame({'StartPort':['THEODOSIA', 'ROSTOV'], 'DestinationPort':['KERCH', 'MARSEILLE']})

r = requests.get('http://ports.com/sea-route/')
soup = BeautifulSoup(r.content, 'html.parser')
rows = soup.findAll('tr', {"class": "span-7 prepend-top"})

startport = []
for a in soup.findAll('a',href=True, attrs={'class':"span-7 prepend-top"}):
    startport=a.find('div', attrs={'class':"span-7 title ac_input"})

Solution

  • You can use their API to get full port names. Then use these names to obtain distance, speed and days at sea.

    For example:

    import requests
    from bs4 import BeautifulSoup
    
    
    from_ = 'Theodosia'
    to_ = 'Kerch'
    
    find_port_url = 'http://ports.com/aj/findport/'
    route_url = 'http://ports.com/aj/sea-route/'
    
    def find_port(port_name):
        return requests.get(find_port_url, params={'q': port_name, 'limit': 1}).text.split('|')[0]
    
    def find_route(f, t):
        data = requests.get(route_url, params={'a':0, 'b':0, 'c': f.split(',')[0], 'd': t.split(',')[0]}, headers={'X-Requested-With': 'XMLHttpRequest'}).json()
        return data['cost']['nauticalmiles'], data['default_speed'], data['days_at_sea']
    
    
    f = find_port(from_)
    t = find_port(to_)
    
    nm, speed, days = find_route(f, t)
    print('Distance: {} nm Speed: {} Days at sea: {:.1f}'.format(nm, speed, days))
    

    Prints:

    Distance: 70 nm Speed: 10 Days at sea: 0.3