python web-scraping data-mining information-extraction

How to extract email-id from company name

I have a excel file which is having company name and company address(around 70k companies). I want to extract the email-id for that company using a web scraper in python. For example if we search APPLE company in google we can find the email-id of that company like so i ant to find email-id of companies that listed in file. So is there any library available or is there any any to extract the email-id?

for example if i search google manzoor export here is the result

You can see email-id is in the search page i want to extract that using python.

Solution

Here's some quick guidelines for constructing a web scraping tool from scratch, using yours as an example:

Building the request

Postman is a useful tool for testing your request to an intended target and verify that it works as intended. In my opinion, it offers a better environment than a web browser's network tab.

In this case, I copy-pasted the search result URL for manzoor export into Postman, removed the unnecessary parameters and sent out a GET request. Upon confirming that it worked, I built the request in Requests syntax:

session = Session()
session.head('https://www.google.com/')

def google_search(input_string):
    response = session.get(
        url = 'https://www.google.com/search',
        params = {
          "q": input_string
        }
    )
    return response

Parsing the request output

Beautiful Soup is a popular Python library which makes it very easy to parse HTML (I've actually used it in my very first scraping tool). The reason why I'm avoiding it here is that nowadays, I prefer a bare bones alternative which is also more efficient: lxml. Once you get familiar with its syntax, you appreciate how powerful it is.

Another helpful tool is an HTML formatter like this one which helps you locate attributes of interest much more quickly.

def get_email(response):
  tree = html.fromstring(response.content)
  search_results = tree.xpath("//div[@class='BNeawe s3v9rd AP7Wnd']")
  for index, search_result in enumerate(search_results):
    headings = search_result.xpath("./text()")
    for idx, heading in enumerate(headings):
      if "\nEmail: " == heading:
        r = re.compile(".*@.*")
        text = tree.xpath("//div[@class='BNeawe s3v9rd AP7Wnd']['+index+']/span['+idx+']/text()")
        return list(filter(r.match, text))[0]
  return None

P.S. You can substantially improve this function if you invest more time than I did.

Final touches

Here's the full code below. I've added a few lines which format the search query strings in a way that Google can process them as well as a function which saves the email addresses in a .csv file.

from requests import Session
from lxml import html
import re
import csv
import os

session = Session()
session.head('https://www.google.com/')

def google_search(input_string):
    response = session.get(
        url = 'https://www.google.com/search',
        params = {
          "q": input_string
        }
    )
    return response

def get_email(response):
  tree = html.fromstring(response.content)
  search_results = tree.xpath("//div[@class='BNeawe s3v9rd AP7Wnd']")
  for index, search_result in enumerate(search_results):
    headings = search_result.xpath("./text()")
    for idx, heading in enumerate(headings):
      if "\nEmail: " == heading:
        r = re.compile(".*@.*")
        text = tree.xpath("//div[@class='BNeawe s3v9rd AP7Wnd']['+index+']/span['+idx+']/text()")
        return list(filter(r.match, text))[0]
  return None 

def save_email(email):
  with open("output.csv", 'a+') as f:
    csv_columns = ["Company name", "Email"]
    writer = csv.writer(f)
    if os.stat("output.csv").st_size == 0:
      writer.writerow(csv_columns)
    writer.writerow([company_name, email])

company_name = "manzoor exports"
input_string = company_name.replace(' ', '+')

response = google_search(input_string)
if response.status_code == 200:
  email = get_email(response)
  save_email(email)

There are two more things left to do:

You have to set up a function that loads your Excel dataset. My suggestion is to save your Excel file in CSV format and load that via the csv module.
Google will most certainly prevent you from sending many queries at once. That's why it's best to throttle your requests using the time module.