Last month GitHub released a new feature enabling global search over all
commit messages. One
interesting aspect of this release is the ability to search for a commit by its
SHA1 hash or just the first several characters of the SHA1. This post describes
how we implemented this type of partial matching.
The two methods we have used in the past for this type of search are:
prefix queries
and leading-edge ngram analysis
Leading-edge ngram analysis is performed at indexing time as commits are added
to the search index. The commit SHA1 is broken into subsequently longer tokens
anchored to the beginning of the SHA1 string. With this approach each SHA1
generates 40 tokens. The advantage is that queries are very fast at the expense
of an increase in index size. It would be nice to avoid the index bloat, so
let’s look at prefix queries instead.
Working with search is a process. You create an index and build queries and
realize they need to change. So how do you manage that process of change
without driving everyone you work with insane?
Here is the approach we use at GitHub to update search indices on our
developer’s machines (and also in our test environment). We want this process
to be automatic and painless. Developers should not need to take special
action when search indices are updated - it needs to “just work”.
On the same day, two separate rubyists asked me the very same question: “How
do you communicate between the parent and a forked child worker”. This
question needs a little background information to be properly understood.
Pre-forking is a UNIX idiom. When a process is expected to handle many tasks
simultaneously, child processes can be created to offload the work from the
parent process. Generally this makes the application more responsive; the
child processes can use multiple CPUs and handle IO streams without blocking
the parent. Eric Wong’s Unicorn web server
uses child processes in this fashion. Ryan Tomayko has a fantastic blog
post describing Unicorn and
pre-forking child processes.
There is a small issue with the default Rails logging setup. If left unchecked,
the production log file can grow to fill all available space on the disk and
cause the server to crash. The end result is a spectacular failure brought on
by a minor oversight: that Rails provides no mechanism to limit log file
sizes.
Periodically the Rails log files must be cleaned up to prevent this from
happening. One solution available to Linux users is the built-in logrotate
program. Kevin Skoglund has written a blog post describing how to use
logrotate for rotating rails log files.
The advantage of logrotate is that nothing needs to change in the Rails
application in order to use it. The disadvantage is that the Rails app should
be halted to safely rotate the logs.