08 Mar 2014
Working with search is a process. You create an index and build queries and
realize they need to change. So how do you manage that process of change
without driving everyone you work with insane?
Here is the approach we use at GitHub to update search indices on our
developer’s machines (and also in our test environment). We want this process
to be automatic and painless. Developers should not need to take special
action when search indices are updated - it needs to “just work”.
Generating Versions
We start by assigning a version to each index. This is a SHA1 digest of the
settings and mappings for the index. A simple value v
is also mixed in to
the digest. This value enables us to force an index to be recreated without
actually modifying the settings or mappings.
def version( settings, mappings, v = 0 )
json = MultiJson.dump \
v: v,
s: settings,
m: mappings
sha = Digest::SHA1.hexdigest json
sha.freeze
end
The settings
and the mappings
are both a Hash. Because a Hash in Ruby is
ordered, this version method will return the same SHA1 for the same inputs.
We store the settings and mappings as Ruby Hashes in source code files. We
treat these as the true values. If an index in ElasticSearch has different
settings or mappings, then it is considered out of date.
Storing Versions
Here is the clever bit. We can add any field to the index settings, and
ElasticSearch will store the field. Let’s store our version SHA1 in the index
settings when the index is created.
curl -XPUT 'http://localhost:9200/demo/' -d '{
"settings": {
"index._meta.version": "SHA1",
"index": {
...
}
},
"mappings": {
...
}
}'
After we generate the version SHA1 it is added to the settings Hash under the
index._meta.version
key. We decided on the _meta
namespace to contain
metadata about the index. Other fields can be added to this namespace if the
need arises.
Checking Versions
When a mapping is updated or changed in our source code, the corresponding
ElasticSearch index needs to be recreated and populated. Here is where the
version SHA1 pays off.
The changes in the mapping result in a new version SHA1. This will be
different than the SHA1 stored in ElasticSearch.
curl -XGET 'http://localhost:9200/demo/_settings'
When these two version SHA1s disagree, our bootstrap script knows to recreate
and repopulate the index. And all of this happens without any extra work
required from our developers.
Here’s to reducting friction!
Why Use a SHA1?
Wouldn’t a version number be simpler? Well yes … and no. We could put an
explicit version number in the settings Hash for each index. You could then
look at the source code and easily see that “this is version 42 of the index.”
The problem is remembering to change that version number each time you make a
change to the settings or mappings. Forcing a human to remember anything is
error prone - it increases friction.
With the SHA1 the whole versioning process is automated. There is nothing to
forget, and we can focus on improving search - not managing search.
Thanks for reading.
02 Jan 2011
On the same day, two separate rubyists asked me the very same question: “How
do you communicate between the parent and a forked child worker”. This
question needs a little background information to be properly understood.
Pre-forking is a UNIX idiom. When a process is expected to handle many tasks
simultaneously, child processes can be created to offload the work from the
parent process. Generally this makes the application more responsive; the
child processes can use multiple CPUs and handle IO streams without blocking
the parent. Eric Wong’s Unicorn web server
uses child processes in this fashion. Ryan Tomayko has a fantastic blog
post describing Unicorn and
pre-forking child processes.
Servolux provides a Prefork
class to manage a
pool of forked child processes. Internally the Prefork class uses a
pipe to send messages and status between the parent and child. This pipe
provides all kinds of niceties like heartbeat, child timeout, and rolling
restarts. However, this pipe cannot be used for general communication to the
child by the end user.
In fact, communication with a specific child process is usually not the
desired behavior.
Two of the major reasons for using child processes is (1) to take advantage of
multiple CPUs or (2) to handle IO intensive tasks. In both of these situations
each child process should be interchangeable with any other child process.
That is, we don’t care which child is handling a calculation or some IO
process; one child is as good as another.
So, the communication pipe used by the Prefork class is not really what
we’re after. It is used to manage a specific child. Instead we need a way to
send a message to the entire pool of child workers. Any currently available
child can handle the message.
Beanstalkd
The simplest method of communication with the child processes is via a message
queue. The Servolux gem provides some example
documentation
showing how to use a Beanstalkd message
queue to send jobs to the child processes. Each child establishes a connection
to the queue and waits for messages to process. The user pushes messages onto
the queue to be handled by some child.
Sockets
A harder (but more educational) method is to use sockets for communication to
a child process. The following example is lifted mostly from Ryan Tomayko’s
blog post mentioned above, but with a smattering of Servolux
thrown in for good measure.
The key thing to take away here is that we create a UNIX server socket in the
parent process, and the forked children “accept” on this socket to receive
messages. As odd as it seems, the parent then creates a UNIX socket in order
to send messages to the children; the parent sends messages to the children
who are accepting.
require 'socket'
require 'open-uri'
require 'rubygems'
require 'servolux'
require 'nokogiri'
PATH = '/tmp/web_fetch.socket'
# Create a UNIX socket at the tmp PATH
# Runs once in the parent; all forked children inherit the socket's
# file descriptor.
$acceptor = UNIXServer.new(PATH)
# This module defines the process our forked workers will run. It listens on
# the socket and expects a single URL. It will then fetch this URL and parse
# the contents using nokogiri.
module WebFetch
def execute
if IO.select([$acceptor], nil, nil, 2)
socket, addr = $acceptor.accept_nonblock
url = socket.gets
socket.close
doc = Nokogiri::HTML(open(url)) { |config| config.noblanks.noent }
$stderr.puts "child #$$ processed #{url}"
$stderr.flush
end
rescue Errno::EAGAIN, Errno::ECONNABORTED, Errno::EPROTO, Errno::EINTR
end
def after_executing
$acceptor.close
end
end
# Spin up a pool of these workers
pool = Servolux::Prefork.new(:module => WebFetch)
pool.start 3
# 'urls.txt' is a simple text file with one URL per line
urls = File.readlines('urls.txt')
begin
# Keeping sending URLs to the workers until we have run out of URLs
until urls.empty?
client = UNIXSocket.open(PATH)
client.puts urls.shift
client.close
end
rescue Errno::ECONNREFUSED
retry
ensure
# Give the workers time to complete their current task
# and then stop the pool
sleep 5
pool.stop
$acceptor.close
File.unlink if File.socket?(PATH)
end
Because each child has a copy of the UNIX server socket, each child also needs
to close this socket. This is done in the after_executing method. This
method is called just before the child process exits. Resource cleanup happens
here.
The parent process also needs to close the UNIX server socket and remove the
socket file created in the tmp folder. These final steps are performed in the
ensure block to ensure they happen.
Conclusion
The main concept to take away here is that pre-forked workers are
indistinguishable from one another. One child process is as good as another.
The vast majority of the time you will need to pass messages to the first
available child worker. If you find yourself needing to communicate with a
specific child then Prefork is most likely not the solution you are
looking for.
01 Dec 2010
There is a small issue with the default Rails logging setup. If left unchecked,
the production log file can grow to fill all available space on the disk and
cause the server to crash. The end result is a spectacular failure brought on
by a minor oversight: that Rails provides no mechanism to limit log file
sizes.
Periodically the Rails log files must be cleaned up to prevent this from
happening. One solution available to Linux users is the built-in logrotate
program. Kevin Skoglund has written a blog post describing how to use
logrotate for rotating rails log files.
The advantage of logrotate is that nothing needs to change in the Rails
application in order to use it. The disadvantage is that the Rails app should
be halted to safely rotate the logs.
Another solution is to replace the default Rails logger with the Ruby
logging framework. The logging framework is
an implementation of the Apache
Log4j framework in Ruby. Although more
complex than using logrotate, the logging framework allows Rails
to roll it’s own log files without needing to halt the application. Another
advantage is that logging destination (a file in this case) can be pointed to
a syslog daemon via a configuration file.
Configuring Rails
Rails provides for application configuration via the config/environment.rb
file; setting the Rails.logger
in this file is the normal way of specifying
an alternative logger to use. However, the logging framework needs to be
interposed much earlier in the Rails initialization process. This is
accomplished via the much loved/dreaded monkey-patch.
The following ruby code should be saved in your Rails app as
lib/logging-rails.rb
:
require 'logging'
class Rails::Initializer
def load_environment_with_logging
load_environment_without_logging
_fn = File.join(configuration.root_path, 'config', 'logging.rb')
return unless test(?f, _fn)
_log_path = File.dirname(configuration.log_path)
FileUtils.mkdir _log_path if !test(?e, _log_path)
config = configuration
eval(IO.read(_fn), binding, _fn)
silence_warnings {
Object.const_set("RAILS_DEFAULT_LOGGER", Logging::Logger[Rails])
}
end
alias :load_environment_without_logging :load_environment
alias :load_environment :load_environment_with_logging
# Sets the logger for Active Record, Action Controller, and Action Mailer
# (but only for those frameworks that are to be loaded). If the framework's
# logger is already set, it is not changed.
#
def initialize_framework_logging
frameworks = configuration.frameworks &
[:active_record, :action_controller, :action_mailer]
frameworks.each do |framework|
base = framework.to_s.camelize.constantize.const_get("Base")
base.logger ||= Logging::Logger[base]
end
ActiveSupport::Dependencies.logger ||= Logging::Logger[ActiveSupport::Dependencies]
Rails.cache.logger ||= Logging::Logger[Rails.cache]
end
end
Include this file at the top of your config/environment.rb
file just after
the Rails bootstrap require line.
# Be sure to restart your server when you modify this file
# Specifies gem version of Rails to use when vendor/rails is not present
RAILS_GEM_VERSION = '2.3.8' unless defined? RAILS_GEM_VERSION
# Bootstrap the Rails environment, frameworks, and default configuration
require File.join(File.dirname(__FILE__), 'boot')
require 'logging-rails'
Rails::Initializer.run do |config|
# Settings in config/environments/* take precedence over those
# specified here. Application configuration should go into files
# in config/initializers -- all .rb files in that directory are
# automatically loaded.
...
end
Logging.show_configuration if Logging.logger[Rails].debug?
As a bonus, at the very end of the environment file is a line that will dump
the current logging configuration to STDERR when the Rails logger is set
to debug mode. This visual display of the configuration is very useful for
understanding where log messages will be sent.
Configuring Logging
Now that Rails has been cowed into submission, it is time to configure the
logging framework and enable rolling log files. A new configuration file has
been introduced by the lib/logging-rails.rb
file. The config/logging.rb
file contains the base settings for the logging framework; these settings can
be overridden and refined in the environment specific Rails configuration
files.
Copy the following code to your config folder:
# When an object passed as the logging message, use the inspect method to
# convert it to a string. Other format options are :string and :yaml.
#
Logging.format_as :inspect
# Use a layout pattern for injecting a timestamp, log level, and classname
# into the generated log message.
#
layout = Logging.layouts.pattern(:pattern => '[%d] %-5l %c : %m\n')
=begin
Logging.appenders.stdout(
'stdout',
:auto_flushing => true,
:layout => layout
)
=end
# Configure the rolling file appender. Roll the log file each day keeping 7
# days of log files. Do not truncate the log file when it is first opened.
# Flush all log messages immediately to disk (no buffering).
#
Logging.appenders.rolling_file(
'logfile',
:filename => config.log_path,
:keep => 7,
:age => 'daily',
:truncate => false,
:auto_flushing => true,
:layout => layout
)
# Set the root logger to use the rolling file appender, and set the log level
# according to the Rails configuration.
#
Logging.logger.root.level = config.log_level
Logging.logger.root.appenders = %w[logfile]
# Under Phusion Passenger smart spawning, we need to reopen all IO streams
# after workers have forked.
#
# The rolling file appender uses shared file locks to ensure that only one
# process will roll the log file. Each process writing to the file must have
# its own open file descriptor for flock to function properly. Reopening the
# file descriptors after forking ensures that each worker has a unique file
# descriptor.
#
if defined?(PhusionPassenger)
PhusionPassenger.on_event(:starting_worker_process) do |forked|
Logging.reopen if forked
end
end
This file is Ruby code that calls the configuration hooks provided by the
logging framework. There are many more
examples demonstrating
the various appenders and techniques to achieve the desired logging output. A
brief overview of what is happening is warranted, though.
The first line describes how objects will be formatted when passed as the log
message. The allowed values for format_as
are :string, :inspect, and
:yaml.
The second line defines how log messages will be formatted before being sent
to the appenders (appenders do the actual writing of the log message to the
logging destination). The PatternLayout
class is well documented. This layout will be assigned to the rolling file
appender.
Next comes the rolling file appender definition. Each appender is given a name
that can be used to refer to the appender later in the configuration. The
rolling file appender is configured to roll daily and keep the last 7 days of
log files. Setting the :truncate
flag to true
will cause the current log
file to be truncated when it is opened; usually this is not the desired
behavior when Rails start. The :auto_flushing
flag can either be true
or
it can be a number; when a number is used, then that number of log messages
are buffered and then written to disk in bulk. The configuration shown flushes
each log message to disk as soon as it is sent to the appender.
Finally is the configuration of the root logger. Since the logging
framework is written in the spirit of Log4j, there is a hierarchy of logger
instances each with their own log level and possibly their own appender. The
rolling file appender is assigned to the root logger, and the log level is set
according to the Rails configuration.
Congratulations! Rails is now rolling it’s log files.
Bonus!
Now that the logging framework has been integrated into the Rails app
other appenders can be used. Multiple appenders can be used simultaneously,
too.
At work, we are using the syslog appender to send all our log messages to a
central logging server. We then use Splunk to
extract information from the aggregated log files. That is another entire post
in itself. Please respond on Twitter if you are
interested in learning more.
Postscript
This approach to rolling Rails log files only works with Rails 2 projects.
Rails 3 has drastically changed the initialization process. A post is
forthcoming on how to integrate the logging framework with Rails 3; the
first step is digging into the new initialization process.