Coding Relic: AppEngine

Showing posts with label AppEngine. Show all posts

Thursday, April 9, 2009

More Google App Engine - Feedflares

Today's article is a discussion about some of the infrastructure for running this site. So far as I can tell from the logs and analytics, nearly everyone reading this blog does so via RSS. The RSS feed for this site is provided by FeedBurner, now part of Google. FeedBurner supports FeedFlares, small widgets appended after the content which can supply additional information or link to other services. I currently use several FeedFlares in the RSS feed, for del.icio.us and friendfeed. The friendfeed flare is new, and is the topic of this writeup.

friendfeed is a social media aggregation service, collecting updates from services like Digg, Flickr, and various blog platforms into a single stream of updates. Many good articles about friendfeed can be found on louisgray.com.

The RSS feed for this blog is imported into friendfeed where people can see it, mark it as something they liked, or leave comments. At the time I started working on this project there was not a FeedFlare for friendfeed. There is now, but I decided to finish my version anyway and post it here. As Google App Engine is my favorite new toy, the FeedFlare is a GAE application.

We'll go straight to the code which gathers information from friendfeed to create the FeedFlare. I'm going to skip the boilerplate code for an application on the Google App Engine. It can be found on an earlier article about the App Engine, if needed. The complete source for this feedflare is also available for download.

class FriendfeedFlare(webapp.RequestHandler):
  def get(self):
    self.response.headers['Content-Type'] = "text/plain"

    scheme, host, path, param, query, frag = urlparse.urlparse(self.request.url)
    args = cgi.parse_qs(query)

    url      = self.parseArg(args, "url")
    nickname = self.parseArg(args, "nickname")
    api_key  = self.parseArg(args, "api_key")

    if (url == None):
        self.response.out.write("<FeedFlare><Text>No URL specified!</Text></FeedFlare>\n")
        return

    subscribed = 1 if (nickname != None and api_key != None) else 0

Three arguments are accepted, using the standard CGI convention of http://host/path?arg1=value&arg2=value

url - the url of the RSS item this FeedFlare should reference. This argument is required.
nickname - the friendfeed account to authenticate as. If provided, the search will be restricted to friends of this nickname. If not provided, we search all entries on friendfeed.com.
api_key - the API Key to authenticate us. If nickname is provided the api_key must also be provided.

Note: At the time of this writing (4/9/2009) the nickname functionality to restrict results to subscribers is not working. It was working a couple days ago, but seemed to break just as I posted this article. I'll update the post if I get it working again, for now the feedflare is useable when searching all entries on friendfeed.com.

urlparse.urlparse() is employed to break the URL into its main components, and then cgi.parse_qs() pulls out the individual parameters. parse_qs() returns each argument as a list, because it allows multiple instances of an argument. In this case only one makes sense, so we get back a list with one member. self.parseArg() is a small helper routine to return None if the argument is not present, or the first element in the list returned from cgi.parse_qs().

    try:
        ffsession = friendfeed.FriendFeed(nickname, api_key);
        entries   = ffsession.fetch_url_feed(url, subscribed);
    except IOError:
        self.error(503);
        return

Friendfeed supplies Python wrapper functions for their API. The wrapper functions are used here to connect to friendfeed.com, using the authorization credentials (if present). If friendfeed is not responding, a 503 response is sent to feedburner.com. This Service Unavailable result tells feedburner to continue to use its cached information and to try again later.

fetch_url_feed() is a function added to the friendfeed API, to support their /api/feed/url API. It fetches all entries which reference the given url.

    totalshares   = 0
    totalcomments = 0
    likers        = set()
    linkurl       = "http://friendfeed.com/"
    linkcomments  = -1
    
    for entry in entries["entries"]:
        totalshares   += 1
        numcomments    = len(entry["comments"])
        totalcomments += numcomments
        if (numcomments > linkcomments):
            linkurl = "http://friendfeed.com/e/" + entry["id"]
            linkcomments = numcomments
        for like in entry["likes"]:
            liker = like["user"]
            likers.add(liker["name"])

    totallikes = len(likers)

The friendfeed API returns entries in JSON format, which is parsed by their API and returned as nested Python lists. To count the number of likes and comments, one needs to iterate over each entry.

likers is a Python set, a datatype I learned about while working on this project. A set is a group of objects which will contain no duplicates. If you add an item to the set which is already present the set will contain only one instance of the item, not two. This is used to avoid overcounting likes: if the URL we are looking for was shared multiple times in friendfeed and the same user marked every one of them as liked, we only want to count that as one like not many.

The linkurl is a compromise. I'd really like to direct the link to a page containing all of the results for this URL. Unfortunately only the friendfeed JSON API includes URL search functionality, the web search page does not. So far as I can tell there is no way to link back to friendfeed for more than one entry ID. So here we link to the entry with the most comments.

    self.response.out.write("<FeedFlare>\n")
    if (totalshares == 0):
        self.response.out.write("  <Text>On Friendfeed: 0 shares</Text>\n")
    else:
        self.response.out.write("  <Text>On Friendfeed: " +                      \
                                self.fmtTotal(totalshares,   "Share")   + ", " + \
                                self.fmtTotal(totallikes,    "Like")    + ", " + \
                                self.fmtTotal(totalcomments, "Comment") +        \
                                "</Text>\n");
    self.response.out.write("  <Link href=\"" + linkurl + "\"/>\n");
    self.response.out.write("</FeedFlare>")
    return

Generate the XML output. self.fmtTotal() is another little helper routine to pluralize the output correctly, "1 Comment" versus "2 Comments" The result of all this processing is a simple bit of XML:

<FeedFlare>
  <Text>On Friendfeed: 5 Shares, 1 Like, 2 Comments</Text>
  <Link href="http://friendfeed.com/e/1b0141a1-f6fa-1be2-e775-e5d36959e04c"/>
</FeedFlare>

This is all feedburner needs to create the FeedFlare. All formatting, including the font size and the blue text coloring, is hard-coded by feedburner. The FeedFlare does not get to supply any formatting, just some text and an optional link.

  def fmtTotal(self, count, descr):
    suffix = "" if (count == 1) else "s"
    return str(count) + " " + descr + suffix

  def parseArg(self, args, argname):
    try:
        ret = args[argname][0]
    except:
        ret = None
    return ret

The aforementioned helper routines.

Thats it, or rather thats the interesting part. The complete source can be downloaded.

The next question is, what is missing? What does it not do, that perhaps it should?

There is no caching of the result. Every request for the FeedFlare results in another API request to friendfeed.com. I believe this is acceptable because FeedBurner limits the rate of FeedFlare requests to about one per two hours.
The link in the generated FeedFlare points to the friendfeed entry with the most comments. This is a compromise. I'd rather to link to a search results page with all of the entries regarding the given URL, but can not find a good way to do it. I'd have to make the FeedFlare dynamically construct a page populated with all of the links, showing all of the likes and comments... and that is too much work for this little project. I hope that someday, friendfeed.com will provide a way to supply multiple entry IDs to appear on a single page.

Using the FeedFlare

If you are interested in using this FeedFlare on your own blog, please feel free. You have a few options:

To use it without a specific nickname (so the results will include Everyone on friendfeed whether they follow you or not) you can use this link as the Flare Unit URL in the Feedburner -> Optimize -> FeedFlares page for your feed.
To configure it to only include people who subscribe to you on friendfeed, download http://feedflare.geekhold.com/feedflareunit/friendfeeduser.xml">friendfeeduser.xml. Replace MY_NICKNAME with your friendfeed account name, and MY_API_KEY with your Remote API Key, and put the modified file somewhere on your own site to be used to configure FeedBurner.
The functionality to restrict the results to your subscribers is not working right now. Please stay tuned. I'll post an update on friendfeed.com if I get it working again.
If you don't like something about the way this code works, you're free to modify it. You can download the source code, set up your own Google App Engine application, and modify it as desired.

Thursday, March 19, 2009

Exploring Google App Engine

I registered my domain in 1996, when Network Solutions was the only registrar and their domain registration forms were faxed in. Back then email and web service providers were rather expensive for a small private domain, so I set up the appropriate services on an OpenBSD system at my house. We had a cable modem from @Home, with a static IP address because DHCP was new and unproven and PPPoE did not exist. We bought DNS service from Illuminati Online, and pointed it at the @Home static IP address.

In 2009 running servers on a system in ones home is far less the entertaining pursuit than it was in 1996. Botnets launch constant automated sweeps looking for vulnerable machines, and the deluge of spam is ever-increasing. So I started looking for alternatives, and fortunately the intervening years have radically changed the ISP market. Lots of hosting providers are available for a small fee. However the trickle of traffic to our current web site comes entirely from family, friends, and botnets, and I'd prefer to avoid paying for such a small installation. Google Sites is free of charge, but does not allow subdirectories. We have a modest collection of pages built up over a decade, I'm not interested in redoing all of the links to flatten the hierarchy.

There is another free option now: Google App Engine. Intended to run Python applications, it does handle static files and allows subdirectories in its file hierarchy. I started with the most straightforward approach for serving entirely static files, relying on Charles Engelke's writeup. You register for Google App Engine, put your HTML files in a subdirectory (which I've named named "static"), and create an app.yaml like so:

application: my-static-webpages
version: 1
runtime: python
api_version: 1

- url: (.*)/
  static_files: static\1/index.html
  upload: static/index.html

- url: /.*
  static_dir: static

This works, albeit with drawbacks, the most serious being the handling of inexact links. There are external links to our pages which lack the trailing slash, pointing to "http://www.example.com/foo" where foo is a directory. The existing Apache server would kindly send a redirect to "http://www.example.com/foo/" but the App Engine static handler returns a 404 error. I want to be forgiving, and not break existing links. Charles Engelke wrote a followup article describing how to perform basic HTTP redirects, so I decided to tackle something similar. Rather than use the static handler, we'll use Python. We configure app.yaml to run the Python handler:

application: my-static-webpages
version: 1
runtime: python
api_version: 1

handlers:
- url: /.*
  script: forgivedirectories.py

Though I bought my first Python book many years ago, I've done relatively little work in the language. This was quite a learning experience.

forgivedirectories.py: from google.appengine.ext import webapp from google.appengine.ext.webapp.util import run_wsgi_app import os class ForgiveDirectories(webapp.RequestHandler): def get(self): fullPath = 'static/' + self.request.path if not os.path.exists(fullPath): self.redirect('/404.html') return	If it really is a bad link, we send them to an error page.
if os.path.isdir(fullPath): if fullPath[-1] != '/': self.redirect(self.request.path + '/') return else: fullPath += '/index.html'	This is the "forgive" part of forgivedirectories.py. If the requested path is a directory but does not end in a slash, send them a redirect to the proper path. Note that its important to send a redirect, and not just send the index.html page. Relative links like "../bar/index.html" will only work if the browser's path ends in a slash.
# if Google App Engine adds a # sendfile equivalent, we should # use it here. fh = open(fullPath, 'r') while 1: block = fh.read(16384) if not block: break self.response.out.write(block) fh.close	If we make it here, we have a page to send. To limit the memory footprint we loop over every 16 KBytes; slurping the entire file into memory is neither necessary nor beneficial.
application = webapp.WSGIApplication( [('.*', ForgiveDirectories)], debug=True) def main(): run_wsgi_app(application) if __name__ == "__main__": main()	Boilerplate code to initialize the web services gateway and call into our function. Its possible to match regular expressions to dispatch to multiple different classes, but here we match everything and send it to ForgiveDirectories.

This works, sortof. You can see the web pages now. One problem is that the Content-Type is not being set, so browsers have to assume a default. Amusingly enough most browsers default to text/html, so at least the pages render even if images are broken. Google Chrome defaults to text/plain, so it shows the HTML source of the pages. So we obviously need to set the Content-Type, which we will do by examining the file's extension. Being new to Python, it took a few tries to get something reasonable.

def contentTypeFromExt(self, extension): if extension=='.html': return 'text/html' elif extension=='.jpg': return 'image/jpeg' elif extension=='.gif': return 'image/gif' elif extension=='.png': return 'image/png' elif extension=='.js': return 'application/x-javascript' elif extension=='.css': return 'text/css' else: return 'text/plain'	The first attempt. Its basically C code, translated into Python.
def contentTypeFromExt(self, ext): contenttype = { ext == '.html': 'text/html', ext == '.jpg' : 'image/jpeg', ext == '.gif' : 'image/gif', ext == '.png' : 'image/png', ext == '.js' : 'application/x-javascript', ext == '.css' : 'text/css'}[1] return contenttype	The second attempt. Python does not have a switch statement, but some Google searching showed how to construct something that looks like one if you squint at it just right. Unfortunately this version doesn't supply 'text/plain' as a default, and its just... weird. Trying to torture the code to vaguely resemble a switch statement is not very pleasing.
contentExtensions = { '.html' : 'text/html', '.jpg' : 'image/jpeg', '.gif' : 'image/gif', '.png' : 'image/png', '.js' : 'application/x-javascript', '.css' : 'text/css'} def contentTypeFromExt(self, ext): try: return self.contentExtensions[ext] except KeyError: return 'text/plain'	The third attempt, and the first real effort to do it in a Pythonic way. Python has hash tables as a fundamental type in the language, so let's use them! If the key is not found in the dictionary an exception will be thrown... so I guess we're supposed to catch it? I dunno. Obviously, not finding the key in the dictionary is expected to be an unusual occurrence. There must be a better way to do it.
contentExtensions = { '.html' : 'text/html', '.jpg' : 'image/jpeg', '.gif' : 'image/gif', '.png' : 'image/png', '.js' : 'application/x-javascript', '.css' : 'text/css'} def contentTypeFromExt(self, ext): return self.contentExtensions.get(ext, \ 'text/plain')	The current version, which I'm pretty happy with. By calling the get method directly we can supply a default value to be returned, instead of having it throw an exception.

Now we need to actually set the Content-Type header. I'm using os.path.splitext() to pull out the file extension, which is probably wrong: on Windows I think that method expects backslashes, where a URL will always be forward slashes. It is fine when used with the App Engine (which is not hosted on Windows), so I'll rely on it until I can figure out a more appropriate method to use.

  def contentTypeFromPath(self, path):
    (basename, extension) = os.path.splitext(path)
    return self.contentTypeFromExt(extension)
  
  # The following goes in the get() method
  # immediately before we open the fullPath file:
  self.response.headers['Content-Type'] = self.contentTypeFromPath(fullPath);

Left To Be Implemented...

The App Engine is a lot of fun. Its a way to experiment with web services without expense for hosting. The forgivedirectories.py script is now in service,handling our web site. There are a few things left to do, which I hope to cover in future updates:

After reorganizing the web site some time ago, the old Apache server was configured to send 302 redirects for the old hierarchy. The new AppEngine handler should too, just in case there are any links remaining to the old URLs.
For efficiency we should allow the browser to cache the pages, by implementing Last-Modified-Since support. The ipsojobs blog has a writeup about this which I intend to leverage