April 22, 2014

Sever-Side Rendering of Single Page Apps using PhantomJS and Node.js

tl;dr

We present a simple approach to server-side rendering of JavaScript heavy pages using PhantomJS and Node.js. When we receive a request from a bot, we use Nginx to route it to a special Node server. The Node server then spawns a Phantom process to render the page. Once Phantom has rendered the page, Node responds with the fully rendered page.

I recently ran a test to see how well the Googlebot can index a single page app (SPA). It turns out that the Googlebot isn’t all that great. It did a decent job discovering the link structure, but failed to properly read page content. The content of most of the pages that it indexed appeared blank. Having Google index blank pages is really bad, so the rest of this guide is devoted to fixing they underlying cause.

The problem with SPAs is that they often start with a blank page and fill in their content using JavaScript. Since the Googlebot doesn’t have great support for JavaScript, it will often think the initial blank page is what it should index. The Googlebot doesn’t know that it is missing all the content that a normal visitor would see.

Conceptually the fix is easy – don’t rely on the Googlebot to render our pages properly. When the Googlebot request one of our pages, instead of serving up our fancy SPA, we’ll render the equivalent HTML sever-side. We’ll then return this fully rendered webpage to the Googlebot and our site will get indexed properly.

Before we go on, it’s worth noting that Google is totally cool with this procedure. Just make sure to render the same content as normal visitor would see. Don’t try to be clever and add a bunch of keywords that aren’t on your regular site because Google doesn’t take too kindly to cloaking.

You might think that rendering our SPA server-side would be tricky, but it is actually quite straightforward using Phantom. Phantom is a WebKit headless browser, so it can render webpages like Chrome or Safari but has an interface that is accessible from the command line instead of through a GUI. The idea then, is tell Phantom to render our SPA just like a normal browser would, and then pass the fully rendered page back to the Googlebot.

If you don’t have Phantom installed, now would be a good time to install it. Once Phantom is installed, we can run it from the command line like so

phantomjs phantom-server.js http://example.com

Where phantom-server is a script we’ll create and http://example.com is the URL of the webpage we want to render.

phantom-server.js

var system = require('system');
var page = require('webpage').create();

page.open(system.args[1], function () {
    console.log(page.content);
    phantom.exit();
});

We use system to read in the URL and pass it as the first argument to the web page module’s open method. The second argument to open is a callback that fires when Phantom is finished loading the page. Since the callback is executed when the page is finished rendering, page.content is the fully rendered page we’re after. We then send this fully rendered web page to Phantom’s standard output stream using console.log.

That’s the heart of the beast. All that is left to do is to hook up all the wiring to make it work form request to response. Working from the top down, we’re going to have Nginx match the user agent of the Googlebot (as well as some other popular bots) and send that traffic to a Node server. The Node server will then pass the request to Phantom and wait for Phantom to render the page. Once Phantom is done rendering the page, Node will respond with the the fully rendered page.

First up, Nginx. Modify your Nginx configuration file to include the following:

your-app.conf

upstream node_phantomjs_server {
  server 127.0.0.1:8888;
}

server {
.
.
.
  error_page 419 = @bots;
  if ($http_user_agent ~* (googlebot|yahoo|baiduspider|bingbot|yandexbot|teoma)) {
    return 419;
  }

  location @bots {
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header Host $http_host;
    proxy_redirect off;
    proxy_pass http://node_phantomjs_server;
  }
.
.
.
}

We’re doing a bit of a hack here to get Nginx to send our traffic to Node instead of where it normally goes (Rails in my case). If Nginx matches the user agent of one of the bots it sends the traffic to error page 419. We’re using error code 419 because it is unused by the specification so it lends itself to this hack. We’ve set up error page 419 to point to @bots which sets some headers and sends the traffic along to our upstream Node server on localhost port 8888.

Next, let’s set up our Node server to respond to traffic on port 8888.

app.js

var express = require('express');
var app = express();

app.use(function (req, res, next) {
    var content = '';
    var url = req.protocol + '://' + req.get('host') + req.originalUrl;
    var phantom = require('child_process').spawn('phantomjs', ['phantom-server.js', url]);
    phantom.stdout.setEncoding('utf8');
    phantom.stdout.on('data', function(data) {
        content += data.toString();
    });
    phantom.on('exit', function(status_code) {
        if (status_code !== 0) {
            console.log('error');
        } else {
            res.send(content);  
        }
    });
});

app.listen(8888);

Note: some of the code above is borrowed from Thomas Davis, and I encourage you to check out his tutorial on the same subject.

Here, we’re using Express although we could have also written a simple Node sever to accomplish the same thing. Our implementation is quite simple since we handle all requests with the middleware. First, our middleware reconstructs the requested URL, then spawns a Phantom process to render it. We catch all data that Phantom spits out and store it as content. After Phantom is finished, we take the fully rendered page and ship it out.

That’s all there is to it. We’ve now made the Googlebot happy and that makes us happy. However, there is one gotcha that I came across that I thought I would share with you. I tend to store client-side state in the params of my URLs so that given a URL I can parse the params and reconstruct the intended state of my JavaScript app. This makes it easy to share an URL in an email, for example. However, when I make hrefs inside of anchor tags, I don’t always encode ampersands like I should. This is usually not a problem and everything works fine. However, when rendering pages with Phantom, Phantom is kind enough to properly encode ampersands. In my case, this caused a problem because I had written some client-side code that wasn’t expecting encoded ampersands. This behavior may not present a problem for you but it’s something to watch our for.

I hope you’ve found this guide helpful, and I hope you now have a deeper understanding of what it takes to render your JavaScript heavy pages server-side.

Kudos

Sever-Side Rendering of Single Page Apps using PhantomJS and Node.js

Now read this

Conditional Custom Templates with Action Pack Variants