Prefix Query Performance

Last month GitHub released a new feature enabling global search over all commit messages. One interesting aspect of this release is the ability to search for a commit by its SHA1 hash or just the first several characters of the SHA1. This post describes how we implemented this type of partial matching.

The two methods we have used in the past for this type of search are:

  • prefix queries
  • and leading-edge ngram analysis

Leading-edge ngram analysis is performed at indexing time as commits are added to the search index. The commit SHA1 is broken into subsequently longer tokens anchored to the beginning of the SHA1 string. With this approach each SHA1 generates 40 tokens. The advantage is that queries are very fast at the expense of an increase in index size. It would be nice to avoid the index bloat, so let’s look at prefix queries instead.

Versioning ElasticSearch Indices

Working with search is a process. You create an index and build queries and realize they need to change. So how do you manage that process of change without driving everyone you work with insane?

Here is the approach we use at GitHub to update search indices on our developer’s machines (and also in our test environment). We want this process to be automatic and painless. Developers should not need to take special action when search indices are updated - it needs to “just work”.

Preforking Workers in Ruby

On the same day, two separate rubyists asked me the very same question: “How do you communicate between the parent and a forked child worker”. This question needs a little background information to be properly understood.

Pre-forking is a UNIX idiom. When a process is expected to handle many tasks simultaneously, child processes can be created to offload the work from the parent process. Generally this makes the application more responsive; the child processes can use multiple CPUs and handle IO streams without blocking the parent. Eric Wong’s Unicorn web server uses child processes in this fashion. Ryan Tomayko has a fantastic blog post describing Unicorn and pre-forking child processes.

Rolling Rails Log Files

There is a small issue with the default Rails logging setup. If left unchecked, the production log file can grow to fill all available space on the disk and cause the server to crash. The end result is a spectacular failure brought on by a minor oversight: that Rails provides no mechanism to limit log file sizes.

Periodically the Rails log files must be cleaned up to prevent this from happening. One solution available to Linux users is the built-in logrotate program. Kevin Skoglund has written a blog post describing how to use logrotate for rotating rails log files. The advantage of logrotate is that nothing needs to change in the Rails application in order to use it. The disadvantage is that the Rails app should be halted to safely rotate the logs.