2nd May 2023

How We Built a Resilient, Observable, and Scalable Notification System

Messari has a small engineering team. As such, we're always looking for ways to deliver features faster, better, and cheaper. Our governance product offering is quite young and badly needed a customer-facing notification system. This system would send emails to users with details of all off-chain and on-chain proposal events (e.g. created, voting started, voting ending soon, voting ended, proposal executed). Some of these emails would be immediately sent, and others would be sent in a digest (e.g. daily, weekly).

At the start of the project, Messari had two notification systems in use— one internal and one external. They were custom-built to very specific product requirements with minimal feature overlap. Moreover, they had vastly different philosophies and languages: one was event-driven written in Python and the other was cron-driven written in Java and Golang. Lastly, they had different taxonomies, failure handling, unit test coverage, observability, and metrics. Using either quickly became unrealistic given the new governance product requirements and the engineering goals of resiliency, observability, and scalability. We would build a new universal customer notification system to be eventually used by all our services.

Resilient, Observable, and Scalable

A notification system should be resilient because important notifications cannot be lost or unduly delayed. It should also gracefully recover from system failures. To that end, the system is eventually consistent with an at-least-once design.

The system should provide comprehensive observability to gain insights into its internal operations and debug any issues that may arise. There should be monitoring to get a sense of what is normal, when the system is experiencing problems, and how much capacity is needed in the future. Moreover, troubleshooting issues should be relatively simple and straight-forwards.

Lastly, the system should be scalable. It should scale to natural user growth, usage spikes, and additional use cases.

The Approach

The first decision to make was if the new notification system should be choreographed or orchestrated.

By default, many pick choreography because it is a well-known pattern that can be implemented with very well-known tools. A typical architecture would be a set of workers consuming and producing events via persistent queues. Additionally, it is a relatively simple system to stand up, and a functional prototype can be written quickly.

Orchestration, on the other hand, is a much less common solution because it is not a well-known pattern and cannot easily be implemented from first principles. Worse yet, the OSS orchestration systems are so complex that it is best to look for a PaaS provider which adds an adoption hurdle.

We choose orchestration using Temporal via their PaaS offering.

Why orchestration?

Messari has a small engineering team. As such, we value visibility and maintainability. As product requirements become more complex, it is very difficult for one person to know the entire system. It is time consuming to debug problems due to the high complexity of microservice interactions. Adding a new feature may require deep knowledge of the entire system. Overall, it isn't a good devX. As one teammate jokes, "I have PTSD from my last message-based project". In short, orchestration promises a simpler system for a small team to more easily own.

Why Temporal?

There are a few orchestration platforms (e.g Netflix Conductor) to consider. Temporal was chosen for a few reasons. Firstly, it's easy to use and easy to understand the key concepts. They have great documentation with plenty of code samples. The philosophy is very closely aligned with Golang— Messari is a Golang shop. Additionally, there's a lively support forum and community Slack. Secondly, they have built it in a way such that visibility is a first-class citizen. They have out-of-the-box logging, metrics, and tracing. All workflow steps are saved in event histories with their inputs and outputs, and one can replay event histories for local debugging. Lastly, Temporal has a code-first approach. By default, there are no DSLs, YAMLs, JSONs, etc. All workflows and accompanying activities can be completely defined in code.

Execution

One of the unique features of the governance notification product is the ability to send digest emails on a weekly cadence given a user's desired day. This feature was straight-forwards to implement using Temporal because Temporal provides powerful and simple building blocks.

At the crux of the implementation is a long-lived digest workflow. It waits until a certain time. At which time, it will execute a few activities:

prepare digest email: this will call a user service to get user information (e.g. email address) and render the digest email
send email: this will call an external service to actually send the email
record attempt: regardless of the outcomes of the previous activities, this will record an attempt was made to send an email

To complicate matters, the workflow can receive two types of data updates:

a new proposal event to send
a change to the desired digest day

Changing the digest day requires the workflow to change when the activities should run next.

To further increase visibility of running workflows, there are two queries that can retrieve information from a workflow:

query to return when the workflow will send the next email
query to return what the workflow will send in the next email

The above image is the Temporal UI for a weekly digest workflow. It describes the following chronological events:

A timer is started for 2d 13h 42m 54s. When that timer expires, a digest email will be sent.
There are 7 items added to the queue for the next email.
There is an update to the day of the week when the next email should be sent. A timer is started for 19h 27m 20s.
There are 11 items added to the queue for the next email.
The timer expires and a digest email is sent.
A timer starts for next week's digest email.

Overall, we're super happy with Temporal and Temporal Cloud. Temporal does the heavy-lifting so that we can focus on the business logic and delivering resilient products at speed.

Future

With our new notification system up and running, we have several follow-up projects in mind:

extend support to other communication channels (e.g. Slack)
replace the old notification systems with this one
further improve visibility of the system through distributed tracing and more custom alerts
increase resiliency by porting more critical path code to Temporal workflows

If that sounds fun, why not join us on our next project?

Come work with us!

If you’re a software engineer interested in helping us contextualize and categorize the world’s crypto data, we’re hiring. Check out our open engineering positions to find out more.