In the fall of 2013, Sprout Social Engineers Aaron Zimmerman and Dave Huber undertook an ambitious endeavor, writing a new scheduled message sending system to make Sprout’s platform even better. This system enables social media messages (Tweets, Facebook posts and more) to be sent out at a specific time in the future. Here, Zimmerman examines the architectural underpinnings behind Sprout’s scheduling system and discusses why keeping things simple is so important.
Sprout’s original scheduler was built using ActiveMQ, a program that allows you to enqueue items with a delay on the item entering the queue. ActiveMQ was responsible for holding that message in flight until the appropriate time. While it was awesomely simple, this implementation had several critical defects:
- It kept multiple versions of the truth. As a scheduled message is subject to revision or deletion, we had to keep a second version of the message in a relational database. So it was necessary for the queue consumer to query the database upon the message entering the queue to ensure it was still valid.
- It was not resilient to failure. As a scheduled message was a pending queue delivery item, the only thing that made it send at the appropriate time was a local KahaDB file on the ActiveMQ server. We had two servers. The primary would rsync the data to the secondary. But this was far from truly resilient.
- It did not scale. We were using ActiveMQ as a database and a scheduler. It was not performing well under this particular load. As people schedule message often on round hours, 11 a.m., 3 p.m., etc. became hotspots of message delivery. These times would often result in delayed messages, while the ActiveMQ server struggled to enqueue a burst of messages.
To solve the first challenge, we decided to use a single database cluster to support both user interaction with scheduled message and message delivery. To make this database highly available, we set up two MySQL servers. Any updates made against one automatically replicate to the other, meaning we can write to either of them and the data will be in both. Should either server fail or need to be rebooted, the overall system is not impaired at all.
Queue consumers (the application that issues the API request to send the message) are easily made resilient by standing up multiple instances on different servers. But what of the queue producers? How do the database rows become items in the queue? It’s not as easy as running multiple instances, as we don’t want duplicates sent into the queue. So we need to create different modes for the producers. They can run in either active or dormant mode, and then we need a mechanism for toggling the dormant, should the active fail. When we allow multiple simultaneous producers, they need to collude so they don’t duplicate work.
We needed something to guard the scheduler. This guard system would monitor the active producer and, should it crash, flip the state of the passive so there would be no downtime. But what if the guard system itself were to crash? We use multiple guards. But which should act in the event of a crash? The guards themselves will need an active/passive state as well. But what will flip the state of the passive guard, should the active fail?
And so on. When you try to engineer for availability, it’s guards all the way down.
Our solution was to avoid as much architecture as we could. We found a not-quite-perfect-but-close-enough framework to do the hard, distributed things for us. We used Storm.
Storm is designed for distributed stream processing, and if you squint hard enough, scheduled messages look kind of like a stream. I won’t delve into the semantics of Storm, but we had multiple producers (spouts), running on a storm cluster. These spouts would pass their payloads (the messages to send) to a queue consumer (a bolt), where it would be sent and marked as delivered in the database. This producer-consumer pattern is built into Storm, and a lot of effort by really smart people has been put into making message delivery reliable. So using Storm, we can abstract away the bursty nature, the need to buffer events and the actual means that messages will be delivered from one point to another. We can focus on the actual business logic of sending a Tweet.
Operationally, Storm was a big win as well. Storm will restart a producer or consumer elsewhere if its host machine dies. Adding additional processing power to the application is as easy as changing a configuration setting that controls how many servers the topology runs on.
Storm has downsides as well. There isn’t any visibility into the internal queues between the producers and consumers, and no way to see how many items are pending. There is no out-of-the-box way to deploy a topology with zero downtime. Storm passes around tuples, which are lists of objects. Our tuples consist of one object, a “ScheduledMessage” POJO. So to interact with that object, we have to cast it. This is annoying at best and, more likely, a bit dangerous. These problems, and others, could have been mitigated with a more extensive, more perfect architecture.
Our job as engineers is not to write perfect systems. Our job is to build things, to create value for the people who use our software. Our customers do not care about the architecture of distributed locks. While complexity, and architecting around complexity, is an undeniable aspect of engineering, it is our job—nay, our responsibility—to keep things as simple as they can functionally be. Storm wasn’t a perfect fit, and we had to compromise on things. But building it this way let us step away from the architectural rabbit holes and focus on the business logic.
In one sense, we moved from using one tool in a way it wasn’t intended to using a different tool in a way that wasn’t intended. But there is no such thing as perfect architecture. There is only better architecture. And the Storm based implementation beats the “delayed-message” implementation on each of the three problems called out above.
An important dependency to things simple is monitoring. We recognize up front that we aren’t going to catch every corner case that could throw a wrench in things, so we set up alerts to go off if something unexpected happens. Then we fix it.
It is far easier to fix problems in a simple system than it is to create a complicated system that doesn’t have problems.