Thursday, March 19, 2009

Exploring Google App Engine

OpenBSD I registered my domain in 1996, when Network Solutions was the only registrar and their domain registration forms were faxed in. Back then email and web service providers were rather expensive for a small private domain, so I set up the appropriate services on an OpenBSD system at my house. We had a cable modem from @Home, with a static IP address because DHCP was new and unproven and PPPoE did not exist. We bought DNS service from Illuminati Online, and pointed it at the @Home static IP address.

In 2009 running servers on a system in ones home is far less the entertaining pursuit than it was in 1996. Botnets launch constant automated sweeps looking for vulnerable machines, and the deluge of spam is ever-increasing. So I started looking for alternatives, and fortunately the intervening years have radically changed the ISP market. Lots of hosting providers are available for a small fee. However the trickle of traffic to our current web site comes entirely from family, friends, and botnets, and I'd prefer to avoid paying for such a small installation. Google Sites is free of charge, but does not allow subdirectories. We have a modest collection of pages built up over a decade, I'm not interested in redoing all of the links to flatten the hierarchy.

Google App Engine There is another free option now: Google App Engine. Intended to run Python applications, it does handle static files and allows subdirectories in its file hierarchy. I started with the most straightforward approach for serving entirely static files, relying on Charles Engelke's writeup. You register for Google App Engine, put your HTML files in a subdirectory (which I've named named "static"), and create an app.yaml like so:


application: my-static-webpages
version: 1
runtime: python
api_version: 1

- url: (.*)/
  static_files: static\1/index.html
  upload: static/index.html

- url: /.*
  static_dir: static

 

Python This works, albeit with drawbacks, the most serious being the handling of inexact links. There are external links to our pages which lack the trailing slash, pointing to "http://www.example.com/foo" where foo is a directory. The existing Apache server would kindly send a redirect to "http://www.example.com/foo/" but the App Engine static handler returns a 404 error. I want to be forgiving, and not break existing links. Charles Engelke wrote a followup article describing how to perform basic HTTP redirects, so I decided to tackle something similar. Rather than use the static handler, we'll use Python. We configure app.yaml to run the Python handler:


application: my-static-webpages
version: 1
runtime: python
api_version: 1

handlers:
- url: /.*
  script: forgivedirectories.py

Though I bought my first Python book many years ago, I've done relatively little work in the language. This was quite a learning experience.


forgivedirectories.py:

from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
import os

class ForgiveDirectories(webapp.RequestHandler):
  def get(self):    
    fullPath = 'static/' + self.request.path
    if not os.path.exists(fullPath):
      self.redirect('/404.html')
      return
If it really is a bad link, we send them to an error page.
    if os.path.isdir(fullPath):
      if fullPath[-1] != '/':
        self.redirect(self.request.path + '/')
        return
      else:
        fullPath += '/index.html'
This is the "forgive" part of forgivedirectories.py. If the requested path is a directory but does not end in a slash, send them a redirect to the proper path.
Note that its important to send a redirect, and not just send the index.html page. Relative links like "../bar/index.html" will only work if the browser's path ends in a slash.
    # if Google App Engine adds a
    # sendfile equivalent, we should
    # use it here.
    fh = open(fullPath, 'r')
    while 1:
      block = fh.read(16384)
      if not block:
        break
      self.response.out.write(block)
    fh.close
If we make it here, we have a page to send. To limit the memory footprint we loop over every 16 KBytes; slurping the entire file into memory is neither necessary nor beneficial.
application = webapp.WSGIApplication(
                [('.*', ForgiveDirectories)],
                debug=True)

def main():
  run_wsgi_app(application)

if __name__ == "__main__":
  main()
Boilerplate code to initialize the web services gateway and call into our function. Its possible to match regular expressions to dispatch to multiple different classes, but here we match everything and send it to ForgiveDirectories.

 

This works, sortof. You can see the web pages now. One problem is that the Content-Type is not being set, so browsers have to assume a default. Amusingly enough most browsers default to text/html, so at least the pages render even if images are broken. Google Chrome defaults to text/plain, so it shows the HTML source of the pages. So we obviously need to set the Content-Type, which we will do by examining the file's extension. Being new to Python, it took a few tries to get something reasonable.


 
  def contentTypeFromExt(self, extension):
    if extension=='.html':
      return 'text/html'
    elif extension=='.jpg':
      return 'image/jpeg'
    elif extension=='.gif':
      return 'image/gif'
    elif extension=='.png':
      return 'image/png'
    elif extension=='.js':
      return 'application/x-javascript'
    elif extension=='.css':
      return 'text/css'
    else:
      return 'text/plain'
The first attempt. Its basically C code, translated into Python.
  def contentTypeFromExt(self, ext):
    contenttype = {
      ext == '.html': 'text/html',
      ext == '.jpg' : 'image/jpeg',
      ext == '.gif' : 'image/gif',
      ext == '.png' : 'image/png',
      ext == '.js'  : 'application/x-javascript',
      ext == '.css' : 'text/css'}[1]
    return contenttype
The second attempt. Python does not have a switch statement, but some Google searching showed how to construct something that looks like one if you squint at it just right.
Unfortunately this version doesn't supply 'text/plain' as a default, and its just... weird. Trying to torture the code to vaguely resemble a switch statement is not very pleasing.
  contentExtensions = {
      '.html'  : 'text/html',
      '.jpg'   : 'image/jpeg',
      '.gif'   : 'image/gif',
      '.png'   : 'image/png',
      '.js'    : 'application/x-javascript',
      '.css'   : 'text/css'}

  def contentTypeFromExt(self, ext):
    try:
      return self.contentExtensions[ext]
    except KeyError:
      return 'text/plain'
The third attempt, and the first real effort to do it in a Pythonic way. Python has hash tables as a fundamental type in the language, so let's use them! If the key is not found in the dictionary an exception will be thrown... so I guess we're supposed to catch it? I dunno. Obviously, not finding the key in the dictionary is expected to be an unusual occurrence. There must be a better way to do it.
  contentExtensions = {
      '.html'  : 'text/html',
      '.jpg'   : 'image/jpeg',
      '.gif'   : 'image/gif',
      '.png'   : 'image/png',
      '.js'    : 'application/x-javascript',
      '.css'   : 'text/css'}

  def contentTypeFromExt(self, ext):
    return self.contentExtensions.get(ext, \
                               'text/plain')
The current version, which I'm pretty happy with. By calling the get method directly we can supply a default value to be returned, instead of having it throw an exception.

 

Now we need to actually set the Content-Type header. I'm using os.path.splitext() to pull out the file extension, which is probably wrong: on Windows I think that method expects backslashes, where a URL will always be forward slashes. It is fine when used with the App Engine (which is not hosted on Windows), so I'll rely on it until I can figure out a more appropriate method to use.


  def contentTypeFromPath(self, path):
    (basename, extension) = os.path.splitext(path)
    return self.contentTypeFromExt(extension)
  
  # The following goes in the get() method
  # immediately before we open the fullPath file:
  self.response.headers['Content-Type'] = self.contentTypeFromPath(fullPath);

 
Left To Be Implemented...

The App Engine is a lot of fun. Its a way to experiment with web services without expense for hosting. The forgivedirectories.py script is now in service,handling our web site. There are a few things left to do, which I hope to cover in future updates:

  • After reorganizing the web site some time ago, the old Apache server was configured to send 302 redirects for the old hierarchy. The new AppEngine handler should too, just in case there are any links remaining to the old URLs.
  • For efficiency we should allow the browser to cache the pages, by implementing Last-Modified-Since support. The ipsojobs blog has a writeup about this which I intend to leverage