Since joining the CloudForge Development team within CollabNet nearly a year ago, several of my colleagues and I have been primarily focused on delivering improvements to the scalability and reliability of the CloudForge improved Subversion service. With tens of terabytes of live Subversion repository data being generated and used by tens of thousands of our customers, even the smallest improvements we make to the service have a pretty big impact.
CloudForge was built on a strong Subversion architecture, but over the years the ever-growing number of customers put a strain on the platform. In the original architecture, customers (organizations) were assigned to a shared server which was responsible for all Subversion-related activities. These shared servers used Apache HTTP Server name-based virtual hosts for each organization — in fact, several virtual hosts per organization for permutations of SSL/non-SSL, Subversion, ViewVC, etc. This approach actually works rather well in the general sense. The challenges we faced were not so much due to the correctness of this deployment approach as to the scale of it. Unfortunately, Apache has to hold the configuration for each and every one of those virtual hosts in memory. As you might imagine, configuration data for a few thousand virtual hosts can consume a great deal of system memory! Further, every time even one of those configurations needed to change, we’d have to perform a graceful restart of Apache in order to affect that change. Due to the amount of configuration data Apache was required to parse, these graceful restarts could take several minutes to complete. The sheer scale of our customer base simply forced us to reconsider how we managed our virtual hosts.
Re-imagining CloudForge Subversion Hosting
The solution to our Apache HTTP Server challenges was to employ dynamic virtual hosts. These are virtual hosts whose configuration and behavior is determined on-the-fly based on the hostname at which the incoming requests are aimed. Most of our virtual host configuration data was common across all our organizations anyway. All that remained was to dynamically calculate the parts that were unique to each organization. Now, I’d love to tell you that this was available as an out-of-the-box solution, but unfortunately that’s not the case. Apache HTTP Server does provide some support for dynamic virtual hosts, but this support is limited to very basic configuration details (such as the DocumentRoot). Dynamic virtual hosts are certainly not universally supported across all Apache modules.
So in order to make this work for us, we had to customize some things, essentially making anything that was organization-specific in our Apache configuration dynamically determinable. I began by patching Subversion’s mod_dav_svn module to handle dynamic determination of the SVNParentPath directory based on the request’s target hostname. I made a similar change to ViewVC’s configuration parser code, which allowed ViewVC to dynamically select a per-organization configuration file. Finally, I wrote a custom Apache authentication module for the Subversion and ViewVC servers to use. This, too, is able to dynamically select the authentication realm of interest and perform the authentication (and some authorization) checks required. With those pieces in place, we no longer required thousands of virtual host definition blocks in our Apache configuration. For the new architecture, we now have two such blocks. Yes, two. Total.
Meanwhile, my colleagues were busy developing a complete replacement for our old Perl-based Subversion configuration and management daemon. The new system is written in Ruby and is Resque-driven. Resque allows us to easily provision and add as many machines to our worker pool as we need so that jobs are performed in a timely manner. This is especially important for jobs associated with Subversion’s repository hook scripts, which must be queued and handled promptly so that the end-user’s Subversion client isn’t left waiting a long time for them to complete. Resque also allows us to dynamically assign jobs to different queues and with customizable prioritizations. Thus certain jobs types can be prioritized more highly than others (for example, a blocking pre-commit hook job vs. an asynchronous post-commit hook job). Also, jobs associated with customers who pay for their CloudForge accounts can be given preferential treatment over those associated with free accounts.
Another architectural decision we made was to move all customer data to an enterprise-grade SAN storage system. This allowed us to introduce true clustering and load-balancing of our servers. No longer do we assign organizations to specific servers. Rather, we assign them to a cluster of servers. Should the average workload of our servers start to creep up, we can simply add more servers to the cluster without a single configuration change at the application level. Our networked storage solution also allows us to take instantaneous snapshots (with offline backup) of the data it holds, eliminating the need for more cumbersome (and less reliable) data backup solutions.
Launching the New Architecture
In March 2014, we finally “flipped the switch” which enabled our newly redesigned Subversion architecture to be the Subversion hosting framework for all new customers. Since then, over 8,000 new organizations have registered on CloudForge and begun using Subversion there. We’ve fine-tuned some things over the past couple of months, fixing little bugs and inefficiencies as you might expect.
Today, the service is humming along, operating as expected. Despite fielding some 300,000+ commits thus far and amassing nearly another half-terabyte of data, our handful of clustered Subversion servers and Resque workers are barely getting taxed at all. From an operational perspective, this is all very good news!
The new service brings improvements beyond mere operational stability, though. We’re now running the latest major release of the Apache Subversion server modules. We’ve introduced a more powerful user interface (plus performance and reliability improvements) to the Subversion repository data import feature. The new service also boasts an updated ViewVC version with its own user interface makeover.
What remains now is to migrate to the new architecture customers whose Subversion services still reside on the old one. Remember those tens of terabytes of data I mentioned? They remain in use, sitting on servers which are not managed as part of this new Subversion offering. So we’ve been developing an automated migration strategy that will allow us to gradually upgrade customers who signed up prior to the launch of the new service. Our migration system is also Resque-based, with its own queue and priority levels and dedicated servers in the worker pool.
Right now, we’re slowly migrating organizations, one at a time, while we carefully monitor the process and manually verify the integrity of the results. When we are satisfied that all is well, we’ll begin mass migrations with automated integrity checks. If you’re one of our older customers whose data resides in the older system, you’ll eventually receive a notification that the migration of your organization’s Subversion service has begun. We’ve designed the migration to allow you to continue operating against your existing repositories without interruption for as long as possible, but at some point your repositories will be made read-only while we wrap up the last steps of your data migration. Then you’ll get a notification about the completion of your migration, and begin enjoying the fruits of our labor. For more information about the migration procedure, see https://cloudforge.zendesk.com/entries/41723020-Subversion-Service-Architecture-Migration.