Versioning ElasticSearch Indices

Working with search is a process. You create an index and build queries and realize they need to change. So how do you manage that process of change without driving everyone you work with insane?

Here is the approach we use at GitHub to update search indices on our developer’s machines (and also in our test environment). We want this process to be automatic and painless. Developers should not need to take special action when search indices are updated - it needs to “just work”.

Generating Versions

We start by assigning a version to each index. This is a SHA1 digest of the settings and mappings for the index. A simple value v is also mixed in to the digest. This value enables us to force an index to be recreated without actually modifying the settings or mappings.

def version( settings, mappings, v = 0 )
  json = MultiJson.dump \
    v: v,
    s: settings,
    m: mappings

  sha = Digest::SHA1.hexdigest json
  sha.freeze
end

The settings and the mappings are both a Hash. Because a Hash in Ruby is ordered, this version method will return the same SHA1 for the same inputs.

We store the settings and mappings as Ruby Hashes in source code files. We treat these as the true values. If an index in ElasticSearch has different settings or mappings, then it is considered out of date.

Storing Versions

Here is the clever bit. We can add any field to the index settings, and ElasticSearch will store the field. Let’s store our version SHA1 in the index settings when the index is created.

curl -XPUT 'http://localhost:9200/demo/' -d '{
  "settings": {
    "index._meta.version": "SHA1",
    "index": {
      ...
    }
  },
  "mappings": {
    ...
  }
}'

After we generate the version SHA1 it is added to the settings Hash under the index._meta.version key. We decided on the _meta namespace to contain metadata about the index. Other fields can be added to this namespace if the need arises.

Checking Versions

When a mapping is updated or changed in our source code, the corresponding ElasticSearch index needs to be recreated and populated. Here is where the version SHA1 pays off.

The changes in the mapping result in a new version SHA1. This will be different than the SHA1 stored in ElasticSearch.

curl -XGET 'http://localhost:9200/demo/_settings'

When these two version SHA1s disagree, our bootstrap script knows to recreate and repopulate the index. And all of this happens without any extra work required from our developers.

Here’s to reducting friction!

Why Use a SHA1?

Wouldn’t a version number be simpler? Well yes … and no. We could put an explicit version number in the settings Hash for each index. You could then look at the source code and easily see that “this is version 42 of the index.”

The problem is remembering to change that version number each time you make a change to the settings or mappings. Forcing a human to remember anything is error prone - it increases friction.

With the SHA1 the whole versioning process is automated. There is nothing to forget, and we can focus on improving search - not managing search.

Thanks for reading.

Preforking Workers in Ruby

On the same day, two separate rubyists asked me the very same question: “How do you communicate between the parent and a forked child worker”. This question needs a little background information to be properly understood.

Pre-forking is a UNIX idiom. When a process is expected to handle many tasks simultaneously, child processes can be created to offload the work from the parent process. Generally this makes the application more responsive; the child processes can use multiple CPUs and handle IO streams without blocking the parent. Eric Wong’s Unicorn web server uses child processes in this fashion. Ryan Tomayko has a fantastic blog post describing Unicorn and pre-forking child processes.

Servolux provides a Prefork class to manage a pool of forked child processes. Internally the Prefork class uses a pipe to send messages and status between the parent and child. This pipe provides all kinds of niceties like heartbeat, child timeout, and rolling restarts. However, this pipe cannot be used for general communication to the child by the end user.

In fact, communication with a specific child process is usually not the desired behavior.

Two of the major reasons for using child processes is (1) to take advantage of multiple CPUs or (2) to handle IO intensive tasks. In both of these situations each child process should be interchangeable with any other child process. That is, we don’t care which child is handling a calculation or some IO process; one child is as good as another.

So, the communication pipe used by the Prefork class is not really what we’re after. It is used to manage a specific child. Instead we need a way to send a message to the entire pool of child workers. Any currently available child can handle the message.

Beanstalkd

The simplest method of communication with the child processes is via a message queue. The Servolux gem provides some example documentation showing how to use a Beanstalkd message queue to send jobs to the child processes. Each child establishes a connection to the queue and waits for messages to process. The user pushes messages onto the queue to be handled by some child.

Sockets

A harder (but more educational) method is to use sockets for communication to a child process. The following example is lifted mostly from Ryan Tomayko’s blog post mentioned above, but with a smattering of Servolux thrown in for good measure.

The key thing to take away here is that we create a UNIX server socket in the parent process, and the forked children “accept” on this socket to receive messages. As odd as it seems, the parent then creates a UNIX socket in order to send messages to the children; the parent sends messages to the children who are accepting.

require 'socket'
require 'open-uri'

require 'rubygems'
require 'servolux'
require 'nokogiri'

PATH = '/tmp/web_fetch.socket'

# Create a UNIX socket at the tmp PATH
# Runs once in the parent; all forked children inherit the socket's
# file descriptor.
$acceptor = UNIXServer.new(PATH)

# This module defines the process our forked workers will run. It listens on
# the socket and expects a single URL. It will then fetch this URL and parse
# the contents using nokogiri.
module WebFetch
  def execute
    if IO.select([$acceptor], nil, nil, 2)
      socket, addr = $acceptor.accept_nonblock
      url = socket.gets
      socket.close

      doc = Nokogiri::HTML(open(url)) { |config| config.noblanks.noent }
      $stderr.puts "child #$$ processed #{url}"
      $stderr.flush
    end
  rescue Errno::EAGAIN, Errno::ECONNABORTED, Errno::EPROTO, Errno::EINTR
  end

  def after_executing
    $acceptor.close
  end
end

# Spin up a pool of these workers
pool = Servolux::Prefork.new(:module => WebFetch)
pool.start 3

# 'urls.txt' is a simple text file with one URL per line
urls = File.readlines('urls.txt')

begin
  # Keeping sending URLs to the workers until we have run out of URLs
  until urls.empty?
    client = UNIXSocket.open(PATH)
    client.puts urls.shift
    client.close
  end

rescue Errno::ECONNREFUSED
  retry

ensure
  # Give the workers time to complete their current task
  # and then stop the pool
  sleep 5

  pool.stop
  $acceptor.close

  File.unlink if File.socket?(PATH)
end

Because each child has a copy of the UNIX server socket, each child also needs to close this socket. This is done in the after_executing method. This method is called just before the child process exits. Resource cleanup happens here.

The parent process also needs to close the UNIX server socket and remove the socket file created in the tmp folder. These final steps are performed in the ensure block to ensure they happen.

Conclusion

The main concept to take away here is that pre-forked workers are indistinguishable from one another. One child process is as good as another. The vast majority of the time you will need to pass messages to the first available child worker. If you find yourself needing to communicate with a specific child then Prefork is most likely not the solution you are looking for.

Rolling Rails Log Files

There is a small issue with the default Rails logging setup. If left unchecked, the production log file can grow to fill all available space on the disk and cause the server to crash. The end result is a spectacular failure brought on by a minor oversight: that Rails provides no mechanism to limit log file sizes.

Periodically the Rails log files must be cleaned up to prevent this from happening. One solution available to Linux users is the built-in logrotate program. Kevin Skoglund has written a blog post describing how to use logrotate for rotating rails log files. The advantage of logrotate is that nothing needs to change in the Rails application in order to use it. The disadvantage is that the Rails app should be halted to safely rotate the logs.

Another solution is to replace the default Rails logger with the Ruby logging framework. The logging framework is an implementation of the Apache Log4j framework in Ruby. Although more complex than using logrotate, the logging framework allows Rails to roll it’s own log files without needing to halt the application. Another advantage is that logging destination (a file in this case) can be pointed to a syslog daemon via a configuration file.

Configuring Rails

Rails provides for application configuration via the config/environment.rb file; setting the Rails.logger in this file is the normal way of specifying an alternative logger to use. However, the logging framework needs to be interposed much earlier in the Rails initialization process. This is accomplished via the much loved/dreaded monkey-patch.

The following ruby code should be saved in your Rails app as lib/logging-rails.rb:

require 'logging'

class Rails::Initializer
  def load_environment_with_logging
    load_environment_without_logging

    _fn = File.join(configuration.root_path, 'config', 'logging.rb')
    return unless test(?f, _fn)

    _log_path = File.dirname(configuration.log_path)
    FileUtils.mkdir _log_path if !test(?e, _log_path)

    config = configuration
    eval(IO.read(_fn), binding, _fn)

    silence_warnings {
      Object.const_set("RAILS_DEFAULT_LOGGER", Logging::Logger[Rails])
    }
  end

  alias :load_environment_without_logging :load_environment
  alias :load_environment :load_environment_with_logging

  # Sets the logger for Active Record, Action Controller, and Action Mailer
  # (but only for those frameworks that are to be loaded). If the framework's
  # logger is already set, it is not changed.
  #
  def initialize_framework_logging
    frameworks = configuration.frameworks &
                 [:active_record, :action_controller, :action_mailer]

    frameworks.each do |framework|
      base = framework.to_s.camelize.constantize.const_get("Base")
      base.logger ||= Logging::Logger[base]
    end

    ActiveSupport::Dependencies.logger ||= Logging::Logger[ActiveSupport::Dependencies]
    Rails.cache.logger ||= Logging::Logger[Rails.cache]
  end
end

Include this file at the top of your config/environment.rb file just after the Rails bootstrap require line.

# Be sure to restart your server when you modify this file

# Specifies gem version of Rails to use when vendor/rails is not present
RAILS_GEM_VERSION = '2.3.8' unless defined? RAILS_GEM_VERSION

# Bootstrap the Rails environment, frameworks, and default configuration
require File.join(File.dirname(__FILE__), 'boot')
require 'logging-rails'

Rails::Initializer.run do |config|
  # Settings in config/environments/* take precedence over those
  # specified here. Application configuration should go into files
  # in config/initializers -- all .rb files in that directory are
  # automatically loaded.

  ...
end

Logging.show_configuration if Logging.logger[Rails].debug?

As a bonus, at the very end of the environment file is a line that will dump the current logging configuration to STDERR when the Rails logger is set to debug mode. This visual display of the configuration is very useful for understanding where log messages will be sent.

Configuring Logging

Now that Rails has been cowed into submission, it is time to configure the logging framework and enable rolling log files. A new configuration file has been introduced by the lib/logging-rails.rb file. The config/logging.rb file contains the base settings for the logging framework; these settings can be overridden and refined in the environment specific Rails configuration files.

Copy the following code to your config folder:

# When an object passed as the logging message, use the inspect method to
# convert it to a string. Other format options are :string and :yaml.
#
Logging.format_as :inspect

# Use a layout pattern for injecting a timestamp, log level, and classname
# into the generated log message.
#
layout = Logging.layouts.pattern(:pattern => '[%d] %-5l %c : %m\n')

=begin
Logging.appenders.stdout(
  'stdout',
  :auto_flushing => true,
  :layout => layout
)
=end

# Configure the rolling file appender. Roll the log file each day keeping 7
# days of log files. Do not truncate the log file when it is first opened.
# Flush all log messages immediately to disk (no buffering).
#
Logging.appenders.rolling_file(
  'logfile',
  :filename => config.log_path,
  :keep => 7,
  :age => 'daily',
  :truncate => false,
  :auto_flushing => true,
  :layout => layout
)

# Set the root logger to use the rolling file appender, and set the log level
# according to the Rails configuration.
#
Logging.logger.root.level = config.log_level
Logging.logger.root.appenders = %w[logfile]


# Under Phusion Passenger smart spawning, we need to reopen all IO streams
# after workers have forked.
#
# The rolling file appender uses shared file locks to ensure that only one
# process will roll the log file. Each process writing to the file must have
# its own open file descriptor for flock to function properly. Reopening the
# file descriptors after forking ensures that each worker has a unique file
# descriptor.
#
if defined?(PhusionPassenger)
  PhusionPassenger.on_event(:starting_worker_process) do |forked|
    Logging.reopen if forked
  end
end

This file is Ruby code that calls the configuration hooks provided by the logging framework. There are many more examples demonstrating the various appenders and techniques to achieve the desired logging output. A brief overview of what is happening is warranted, though.

The first line describes how objects will be formatted when passed as the log message. The allowed values for format_as are :string, :inspect, and :yaml.

The second line defines how log messages will be formatted before being sent to the appenders (appenders do the actual writing of the log message to the logging destination). The PatternLayout class is well documented. This layout will be assigned to the rolling file appender.

Next comes the rolling file appender definition. Each appender is given a name that can be used to refer to the appender later in the configuration. The rolling file appender is configured to roll daily and keep the last 7 days of log files. Setting the :truncate flag to true will cause the current log file to be truncated when it is opened; usually this is not the desired behavior when Rails start. The :auto_flushing flag can either be true or it can be a number; when a number is used, then that number of log messages are buffered and then written to disk in bulk. The configuration shown flushes each log message to disk as soon as it is sent to the appender.

Finally is the configuration of the root logger. Since the logging framework is written in the spirit of Log4j, there is a hierarchy of logger instances each with their own log level and possibly their own appender. The rolling file appender is assigned to the root logger, and the log level is set according to the Rails configuration.

Congratulations! Rails is now rolling it’s log files.

Bonus!

Now that the logging framework has been integrated into the Rails app other appenders can be used. Multiple appenders can be used simultaneously, too.

At work, we are using the syslog appender to send all our log messages to a central logging server. We then use Splunk to extract information from the aggregated log files. That is another entire post in itself. Please respond on Twitter if you are interested in learning more.

Postscript

This approach to rolling Rails log files only works with Rails 2 projects. Rails 3 has drastically changed the initialization process. A post is forthcoming on how to integrate the logging framework with Rails 3; the first step is digging into the new initialization process.