Search code examples
python-3.xsubdomain

How do I find the ip address of a Google search page using Python


New to Python programming and trying to solve a coding project. I am trying to write a piece of code that will access a subpage within a website. I'm able to access the main page of the site using it's ip to .connect, and then using .sendall and .recv to get the main page's basic info. Now I wan't to move on and capture a search page.

In this specific example: If you type keywords into the address bar (using Chrome at this moment), you get a page of search results. I'm trying to capture the raw data of that page and dump it into a file. I can access the main page ip address for Google using .gethostbyname, but the url for the search page is a string of words. I haven't a clue how to write code that will allow access that page, or to send the search words to trigger the same response from Google, allowing me to capture that data as an answer to .sendall.

Is there a way for me to access this page, which was obviously created and sent back to my web browser, using Python? If I can't by using a simple .connect and .recv code, is there another/bette way?

All recommendations appreciated. Never posted code, so excuse any etiquette errors:

import socket
import sys

try:
  mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
except socket.error:
  print("Failed to create socket.")
  sys.exit()
try:
  host = (socket.gethostbyname("www.google.com"), 80)
except socket.gaierror:
  print("Failed to get host")
  sys.exit()

print (host)

print(type(host))

mysock.connect(host)
message = b"GET / HTTP/1.1\r\n\r\n"
try:
  mysock.sendall(message)
except socket.error:
  print("Failed to send")
  sys.exit()
data = mysock.recv(5000)
mysock.close()

Solution

  • When you initially create a connection socket, your operating system reserves a "file" space (in quotes on purpose, not going to go into it now), that you create on your machine. The operating system then designates a port on your system for the file space that you made, and gives you back a file descriptor, describing its location. This port is where you send and receive data.

    When you run the connect method to connect to some Google URL, the socket library automatically decides that you should use a specific protocol to make the connection, and does some initial communication with the server to create a flow. This flow is where you will send one request, split up into packets of the same size, and receive a response from the server in the same way.

    To create the request, which is basically just a string sent initially to Google's servers that tells them what you want and, more importantly, how you want it, we need to do make something extra called an SSL request. If you'll notice, the correct URL to Google is https://google.com and not http://google.com (although the latter redirects), because you want to negotiate a specific private key to encrypt your communication and hide it from others who might see it. Once you have done your connect magic, you send this SSL request with the send method, normally the request is automatically created by the Python library. you then receive your response, which is the response headers (values mapped to one another giving you some initial info on what you are getting), and then your body, which is HTML code.

    Let's delve into the request a bit more. When you submit a search to Google, the search is saved in the URL that you requested. as @user2357112 said, a search for new apple iphone becomes https://www.google.com/search?q=new+apple+iphone&.... Everything before the equals sign is a GET parameter and everything after it is its value. For your purposes, you only care about the q= portion, which represents the search keywords you entered into the search bar. Everything else should remain the same, separated by ampersands (&).

    Once you have sent a request to that URL and gotten your HTML response, you have to parse it to get the search results. Please make a separate question for that if you have to, since each post should only have one question to answer.