I am new to SEO and just want to get the idea about how it works for Single Page Application with dynamic content.
In my case, I have a single page application (powered by AngularJS, using router to show different state) that provides some location-based search functionalities, similar to Zillow, Redfin, or Yelp. On mt site, user can type in a location name, and the site will return some results based on the location.
I am trying to figure out a way to make it work well with Google. For example, if I type in "Apartment San Francisco" in Google, the results will be:
And when user click on these links, the sites will display the correct result. I am thinking about having similar SEO like these for my site.
The question is, the page content is purely depending on user's query. User can search by city name, state name, zip code, etc, to show different results, and it's not possible to put them all into sitemap. How google can crawl the content for these kind of dynamic page results?
I don't have experience with SEO and not sure how to do it for my site. Please share some experience or pointers to help me get started. Thanks a lot!
===========
Follow up question:
I saw Googlebot can now run Javascript. I want to understand a bit more of this.
When a specific url of my SPA app is opened, it will do some network query (XHR request) for a few seconds and then the page content will be displayed. In this case, will GoogleBot wait for the http response?
I saw some tutorial says we need to prepare static html specifically for Search Engines. If I only want to deal with Google, does it mean I don't have to serve static html anymore because Google can run Javascript?
Thanks again.
If a search engine should come across your JavaScript application then we have the permission to redirect the search engine to another URL that serves the fully rendered version of the page.
Or
Implementation using Phantom.js
We can setup a node.js server that given a URL, it will fully render the page content. Then we will redirect bots to this server to retrieve the correct content.
We will need to install node.js and phantom.js onto a box. Then start up this server below. There are two files, one which is the web server and the other is a phantomjs script that renders the page.
// web.js
// Express is our web server that can handle request
var express = require('express');
var app = express();
var getContent = function(url, callback) {
var content = '';
// Here we spawn a phantom.js process, the first element of the
// array is our phantomjs script and the second element is our url
var phantom = require('child_process').spawn('phantomjs',['phantom-server.js', url]);
phantom.stdout.setEncoding('utf8');
// Our phantom.js script is simply logging the output and
// we access it here through stdout
phantom.stdout.on('data', function(data) {
content += data.toString();
});
phantom.on('exit', function(code) {
if (code !== 0) {
console.log('We have an error');
} else {
// once our phantom.js script exits, let's call out call back
// which outputs the contents to the page
callback(content);
}
});
};
var respond = function (req, res) {
// Because we use [P] in htaccess we have access to this header
url = 'http://' + req.headers['x-forwarded-host'] + req.params[0];
getContent(url, function (content) {
res.send(content);
});
}
app.get(/(.*)/, respond);
app.listen(3000);
The script below is phantom-server.js and will be in charge of fully rendering the content. We don't return the content until the page is fully rendered. We hook into the resources listener to do this.
var page = require('webpage').create();
var system = require('system');
var lastReceived = new Date().getTime();
var requestCount = 0;
var responseCount = 0;
var requestIds = [];
var startTime = new Date().getTime();
page.onResourceReceived = function (response) {
if(requestIds.indexOf(response.id) !== -1) {
lastReceived = new Date().getTime();
responseCount++;
requestIds[requestIds.indexOf(response.id)] = null;
}
};
page.onResourceRequested = function (request) {
if(requestIds.indexOf(request.id) === -1) {
requestIds.push(request.id);
requestCount++;
}
};
// Open the page
page.open(system.args[1], function () {});
var checkComplete = function () {
// We don't allow it to take longer than 5 seconds but
// don't return until all requests are finished
if((new Date().getTime() - lastReceived > 300 && requestCount === responseCount) || new Date().getTime() - startTime > 5000) {
clearInterval(checkCompleteInterval);
console.log(page.content);
phantom.exit();
}
}
// Let us check to see if the page is finished rendering
var checkCompleteInterval = setInterval(checkComplete, 1);
Once we have this server up and running we just redirect bots to the server in our client's web server configuration.
Redirecting bots If you are using apache we can edit out .htaccess such that Google requests are proxied to our middle man phantom.js server.
RewriteEngine on
RewriteCond %{QUERY_STRING} ^_escaped_fragment_=(.*)$
RewriteRule (.*) http://webserver:3000/%1? [P]
We could also include other RewriteCond
, such as user agent to redirect other search engines we wish to be indexed on.
Though Google won't use _escaped_fragment_
unless we tell it to by either including a meta tag; <meta name="fragment" content="!">
or using #!
URLs in our links.
You will most likely have to use both.
This has been tested with Google Webmasters fetch tool. Make sure you include #!
on your URLs when using the fetch tool.