Search code examples
javascriptnode.jsjsoncheeriounirest

Scraping Google Maps Results using the hidden API


I am scraping google maps results with node js using this URL:

https://www.google.com/search?q=pizza&hl=en&tbm=map&tch=1&pb=!4m8!1m3!1d11281.305980319747!2d-74.0083012!3d40.7455096!3m2!1i1024!2i768!4f13.1!7i20!8i40!10b1!12m25!1m1!18b1!2m3!5m1!6e2!20e3!6m16!4b1!23b1!26i1!27i1!41i2!45b1!49b1!63m0!67b1!73m0!74i150000!75b1!89b1!105b1!109b1!110m0!10b1!16b1!19m4!2m3!1i360!2i120!4i8!20m65!2m2!1i203!2i100!3m2!2i4!5b1!6m6!1m2!1i86!2i86!1m2!1i408!2i240!7m50!1m3!1e1!2b0!3e3!1m3!1e2!2b1!3e2!1m3!1e2!2b0!3e3!1m3!1e3!2b0!3e3!1m3!1e8!2b0!3e3!1m3!1e3!2b1!3e2!1m3!1e10!2b0!3e3!1m3!1e10!2b1!3e2!1m3!1e9!2b1!3e2!1m3!1e10!2b0!3e3!1m3!1e10!2b1!3e2!1m3!1e10!2b0!3e4!2b1!4b1!9b0!22m3!1s!2z!7e81!24m55!1m15!13m7!2b1!3b1!4b1!6i1!8b1!9b1!20b0!18m6!3b1!4b1!5b1!6b1!13b0!14b0!2b1!5m5!2b1!3b1!5b1!6b1!7b1!10m1!8e3!14m1!3b1!17b1!20m4!1e3!1e6!1e14!1e15!24b1!25b1!26b1!29b1!30m1!2b1!36b1!43b1!52b1!54m1!1b1!55b1!56m2!1b1!3b1!65m5!3m4!1m3!1m2!1i224!2i298!89b1!26m4!2m3!1i80!2i92!4i8!30m28!1m6!1m2!1i0!2i0!2m2!1i458!2i768!1m6!1m2!1i974!2i0!2m2!1i1024!2i768!1m6!1m2!1i0!2i0!2m2!1i1024!2i20!1m6!1m2!1i0!2i748!2m2!1i1024!2i768!34m16!2b1!3b1!4b1!6b1!8m4!1b1!3b1!4b1!6b1!9b1!12b1!14b1!20b1!23b1!25b1!26b1!37m1!1e81!42b1!46m1!1e9!47m0!49m1!3b1!50m53!1m49!2m7!1u3!4s!5e1!9s!10m2!3m1!1e1!2m7!1u2!4s!5e1!9s!10m2!2m1!1e1!2m7!1u16!4s!5e1!9s!10m2!16m1!1e1!2m7!1u16!4s!5e1!9s!10m2!16m1!1e2!3m11!1u16!2m4!1m2!16m1!1e1!2s!2m4!1m2!16m1!1e2!2s!3m1!1u2!3m1!1u3!4BIAE!2e2!3m1!3b1!59B!65m0!69i540

When you open this URL in your browser, it will download a text file, but I can understand how to parse the data that this text file contains.

And what type of code this text file contains, is it JSON or what I can't understand.

Here is my code:

const cheerio = require("cheerio");
const fs = require("fs");
const unirest = require("unirest");

const getData = async () => {
  try {
    const url =
      "https://www.google.com/search?q=pizza&hl=en&tbm=map&tch=1&pb=!4m8!1m3!1d11281.305980319747!2d-74.0083012!3d40.7455096!3m2!1i1024!2i768!4f13.1!7i20!8i40!10b1!12m25!1m1!18b1!2m3!5m1!6e2!20e3!6m16!4b1!23b1!26i1!27i1!41i2!45b1!49b1!63m0!67b1!73m0!74i150000!75b1!89b1!105b1!109b1!110m0!10b1!16b1!19m4!2m3!1i360!2i120!4i8!20m65!2m2!1i203!2i100!3m2!2i4!5b1!6m6!1m2!1i86!2i86!1m2!1i408!2i240!7m50!1m3!1e1!2b0!3e3!1m3!1e2!2b1!3e2!1m3!1e2!2b0!3e3!1m3!1e3!2b0!3e3!1m3!1e8!2b0!3e3!1m3!1e3!2b1!3e2!1m3!1e10!2b0!3e3!1m3!1e10!2b1!3e2!1m3!1e9!2b1!3e2!1m3!1e10!2b0!3e3!1m3!1e10!2b1!3e2!1m3!1e10!2b0!3e4!2b1!4b1!9b0!22m3!1s!2z!7e81!24m55!1m15!13m7!2b1!3b1!4b1!6i1!8b1!9b1!20b0!18m6!3b1!4b1!5b1!6b1!13b0!14b0!2b1!5m5!2b1!3b1!5b1!6b1!7b1!10m1!8e3!14m1!3b1!17b1!20m4!1e3!1e6!1e14!1e15!24b1!25b1!26b1!29b1!30m1!2b1!36b1!43b1!52b1!54m1!1b1!55b1!56m2!1b1!3b1!65m5!3m4!1m3!1m2!1i224!2i298!89b1!26m4!2m3!1i80!2i92!4i8!30m28!1m6!1m2!1i0!2i0!2m2!1i458!2i768!1m6!1m2!1i974!2i0!2m2!1i1024!2i768!1m6!1m2!1i0!2i0!2m2!1i1024!2i20!1m6!1m2!1i0!2i748!2m2!1i1024!2i768!34m16!2b1!3b1!4b1!6b1!8m4!1b1!3b1!4b1!6b1!9b1!12b1!14b1!20b1!23b1!25b1!26b1!37m1!1e81!42b1!46m1!1e9!47m0!49m1!3b1!50m53!1m49!2m7!1u3!4s!5e1!9s!10m2!3m1!1e1!2m7!1u2!4s!5e1!9s!10m2!2m1!1e1!2m7!1u16!4s!5e1!9s!10m2!16m1!1e1!2m7!1u16!4s!5e1!9s!10m2!16m1!1e2!3m11!1u16!2m4!1m2!16m1!1e1!2s!2m4!1m2!16m1!1e2!2s!3m1!1u2!3m1!1u3!4BIAE!2e2!3m1!3b1!59B!65m0!69i540";

    const response = await unirest.get(url).headers({
      "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
    });

    const $ = cheerio.load(response.body);
    fs.writeFileSync("./maps.txt", response.body);
  }
  catch (e) {
    console.log(e);
  }
};

getData();

Solution

  • The existing answer is in the ballpark, but string replacements that can't differentiate between the real data and the junk to remove seems brittle and unnecessary.

    Because there's not that much junk in the response (10 bytes by my count, at least on this particular response--I'll assume similar responses adhere to the same structure), there's a more precise approach.

    The first step is to remove the trailing /*""*/ from the string: data = data.slice(0, -6). The resulting structure is now valid JSON and can be parsed with JSON.parse.

    The parsed data structure has the following top-level keys, with my summarized annotations for "d", "e" and "u":

    {
      "c": 0,
      "d": "the data payload as a string of length 300256, prefixed by )]}'",
      "e": "6m8XY-2AB_vYkPIPt9mU-A4 (or some other similar id)",
      "p": true,
      "u": "https://www.google.com/search?q=pizza&hl=en... (the full URL echoed)",
    }
    

    The payload we're mainly interested in is in key "d". It's a giant string that is mostly valid JSON except for a 4 byte junk prefix we can strip out, )]}'.

    After that, we can JSON.parse the rest of d to produce a giant sparse nested array with a ton of nulls in it:

    const cheerio = require("cheerio"); // 1.0.0-rc.12
    const fs = require("fs");
    
    const url =
      "https://www.google.com/search?q=pizza&hl=en&tbm=map&tch=1&pb=!4m8!1m3!1d11281.305980319747!2d-74.0083012!3d40.7455096!3m2!1i1024!2i768!4f13.1!7i20!8i40!10b1!12m25!1m1!18b1!2m3!5m1!6e2!20e3!6m16!4b1!23b1!26i1!27i1!41i2!45b1!49b1!63m0!67b1!73m0!74i150000!75b1!89b1!105b1!109b1!110m0!10b1!16b1!19m4!2m3!1i360!2i120!4i8!20m65!2m2!1i203!2i100!3m2!2i4!5b1!6m6!1m2!1i86!2i86!1m2!1i408!2i240!7m50!1m3!1e1!2b0!3e3!1m3!1e2!2b1!3e2!1m3!1e2!2b0!3e3!1m3!1e3!2b0!3e3!1m3!1e8!2b0!3e3!1m3!1e3!2b1!3e2!1m3!1e10!2b0!3e3!1m3!1e10!2b1!3e2!1m3!1e9!2b1!3e2!1m3!1e10!2b0!3e3!1m3!1e10!2b1!3e2!1m3!1e10!2b0!3e4!2b1!4b1!9b0!22m3!1s!2z!7e81!24m55!1m15!13m7!2b1!3b1!4b1!6i1!8b1!9b1!20b0!18m6!3b1!4b1!5b1!6b1!13b0!14b0!2b1!5m5!2b1!3b1!5b1!6b1!7b1!10m1!8e3!14m1!3b1!17b1!20m4!1e3!1e6!1e14!1e15!24b1!25b1!26b1!29b1!30m1!2b1!36b1!43b1!52b1!54m1!1b1!55b1!56m2!1b1!3b1!65m5!3m4!1m3!1m2!1i224!2i298!89b1!26m4!2m3!1i80!2i92!4i8!30m28!1m6!1m2!1i0!2i0!2m2!1i458!2i768!1m6!1m2!1i974!2i0!2m2!1i1024!2i768!1m6!1m2!1i0!2i0!2m2!1i1024!2i20!1m6!1m2!1i0!2i748!2m2!1i1024!2i768!34m16!2b1!3b1!4b1!6b1!8m4!1b1!3b1!4b1!6b1!9b1!12b1!14b1!20b1!23b1!25b1!26b1!37m1!1e81!42b1!46m1!1e9!47m0!49m1!3b1!50m53!1m49!2m7!1u3!4s!5e1!9s!10m2!3m1!1e1!2m7!1u2!4s!5e1!9s!10m2!2m1!1e1!2m7!1u16!4s!5e1!9s!10m2!16m1!1e1!2m7!1u16!4s!5e1!9s!10m2!16m1!1e2!3m11!1u16!2m4!1m2!16m1!1e1!2s!2m4!1m2!16m1!1e2!2s!3m1!1u2!3m1!1u3!4BIAE!2e2!3m1!3b1!59B!65m0!69i540";
    
    fetch(url, { // Node 18 or install node-fetch
      headers: {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
      }
    })
      .then(res => res.text())
      .then(data => {
        const fullObj = JSON.parse(data.slice(0, -6));
        const payload = JSON.parse(fullObj.d.slice(4));
    
        console.log(payload); // truncated to depth 2 by default
    
        // if you want to log the full "d" payload
        //    (from https://stackoverflow.com/a/41882441/6243352)
        // require("util").inspect.defaultOptions.depth = null;
        // console.log(payload);
    
        return fs.promises.writeFile(
          "google-data.json",
    
          // skip null, 2 if you want to compress
          JSON.stringify(payload, null, 2)
        );
      })
    ;
    

    The written file will contain the prettified "d" payload. If you want the whole response with metadata and the prefixes stripped, try fullObj.d = payload;, then write fullObj to file as above.

    Here's a dump of the parsed "d" truncated by Node's default require("util").inspect.defaultOptions.depth which is 2 on my v18.4.0:

    [
      [
        'pizza',
        [
          [Array], [Array], [Array],
          [Array], [Array], [Array],
          [Array], [Array], [Array],
          [Array], [Array], [Array],
          [Array], [Array], [Array],
          [Array], [Array], [Array],
          [Array], [Array], [Array]
        ],
        null,
        1,
        null,
        [ '1662482659245', [Array] ],
        null,
        null,
        null,
        null,
        null,
        0
      ],
      [
        [ 11281.30607800397, -74.00830120147852, 40.74550902413187 ],
        null,
        [ 1024, 768 ],
        13.1
      ],
      [],
      null,
      null,
      null,
      null,
      '4ngXY9u_IfXYkPIP45yLoAk',
      null,
      [
        [
          [Array], [Array],
          [Array], [Array],
          [Array], [Array],
          [Array], [Array]
        ]
      ],
      [ [ [Array], [Array], [Array] ] ],
      null,
      null,
      null,
      null,
      null,
      [ [ [Array] ], [ [Array] ] ],
      null,
      null,
      null,
      null,
      [
        null,
        null,
        [ '0ahUKEwjbuLvCzoD6AhV1LEQIHWPOApQQnVUIqwYoAA' ],
        [ [Array], '0ahUKEwjbuLvCzoD6AhV1LEQIHWPOApQQzzgIrAYoAQ' ],
        null,
        null,
        null,
        null,
        null,
        null,
        null,
        null,
        null,
        null,
        null,
        'IAE='
      ],
      null,
      null,
      null,
      null,
      null,
      [ 9 ],
      null,
      [ [ null, [Array], [Array], 'IAE=' ] ],
      null,
      null,
      null,
      null,
      null,
      'Q2dBd0Fn',
      null,
      [ 2 ],
      0,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      [ 1 ]
    ]
    

    I have no idea how to actually use this data or how the URL was created, so a comment with an API/consumption reference for future visitors would be nice.