In this post, TikTok Frontend Engineer Cheng Liu shares about his team's contribution to the Rush.js project: Cobuilds. He dives into technical details including design, implementation, and testing. Read on to learn about the exciting challenges being solved by our engineers to enable TikTok's frontend build systems and infrastructure.
It's said that a good engineer can find a complex solution for an easy problem, and a better engineer can find a complex solution for a complex problem. But the best engineer finds a simple solution for a complex problem. That was my thought when I first read Elliot Nelson's design proposal on GitHub for Rush "cobuilds" (cooperative builds). After lots of hard work, we're proud to announce that TikTok has contributed this feature with the release of Rush 5.104.0! In this post, I'll share some insights about its development.
Some background: Rush is the open source tool that we use to build TikTok's web monorepo. A monorepo is a single Git repository containing multiple projects -- in our case, around more than 500 TypeScript projects shared across ten business lines, comprising more than 200,000 source files. Recently, monorepos have gained popularity for web applications. As you might guess, a major challenge is build times. Rush already employs caching and parallel processes to optimize this. But as your repo scales up, eventually parallelism needs to be distributed across multiple virtual machines (VM's). The Rush project was started by Microsoft, who solved that problem by interfacing Rush with their open source build accelerator "BuildXL" (analogous to Google's Bazel). However, such systems tend to be complex to integrate, requiring a dedicated pool of VM's with its own workflows and management.
Elliot's "cobuilds" proposal suggested a surprisingly simple alternative: Rush already uses cloud storage to cache previously built projects. Restoring from the cache requires only a few seconds, whereas rebuilding that project might have taken several minutes. So, what if we simply launch two instances of the same job on two different VM's? If the first VM builds some projects and writes their output to the cache, then the second VM can skip those projects, and vice versa. In principle (and if your dependency graph is sufficiently parallizable), this scheme should enable two VM's to build a given set of projects in half the time. Three VM's could reduce it to a third of the time. Because each VM is doing its own build of that pipeline, we call them "cobuilds."
There's a hitch however: What if VM #1 starts to build a project that is already in progress on VM #2? This will always be a cache miss. To avoid that, the cobuild feature needs a way for VM #1 to announce when it starts building a project. That way VM #2 will know to wait for the result to get cached, doing other useful work in the meantime. The proposal was to use a simple key/value store. Most companies are already running a service such as Redis or Memcached. The @rushstack/rush-redis-cobuild-plugin, an open public Rush plugin based on the official Redis implementation, has been released together with the cobuild feature. If you happen to use Redis, you could use the official plugin as a starting point. If not, you are able to create your own plugins in a few steps. (Our company uses a proprietary key/value service, so we included a Rush plugin API that makes it easy to implement your own key/value provider.)
Of course, cobuilds are not as sophisticated as a full-fledged build accelerator. TikTok might still integrate Rush with Bazel or BuildXL one day if our monorepo grows far beyond 900 projects. Until then, cobuilds offer a cost-effective way to add distributed computation on top of your familiar pipelines from GitHub Actions, Jenkins, Circle CI, etc. You won't need to redo all your dashboards, workflows, and service integrations.
The next section assumes you're somewhat familiar with Rush. If not, the official docs already explain these topics very well:
Although the full cobuild architecture won't fit in a blog post, I'd like to share a few of the design challenges we encountered, since they are interesting to think about.
Handling failed cobuilds
Rush currently does not cache failed builds. In a typical setup, Rush's cloud cache is readable by any engineer's machine, but only written during controlled builds of the "main" branch in a continuous integration (CI) environment. Also, because a failed CI build will often succeed if tried again (due to transient errors), it's a reasonable policy to simply never cache failures.
Cobuilds bring new requirements, however. Consider a worst case scenario with 3 runners, and for example a large Jest test suite that takes 3 minutes to produce a failing result:
extra1acquires the lock for project A, spends 3 minutes building it, and fails.
extra2acquires the lock, spends 3 minutes building it, and fails.
primaryfinally acquires the lock, spends 3 minutes building it, and fails.
In total, what should have been a 3-minute failure now takes 3 + 3 + 3 = 9 minutes.
It seems we should enable caching and retrieval of failing builds without affecting the established behavior of Rush's build cache. We solved it by introducing an environment variable called
RUSH_COBUILD_CONTEXT_ID that is the same across every runner VM within a given cobuild run. When a build fails, the cobuild feature writes an empty cache entry using a special cache key that includes this context ID, designated as
<cache_id>-<context_id>-failed. This ensures that failed builds do not conflict with successfully cached builds, but will be retried in future builds. To avoid the risk of accidentally caching using an empty key, if the
RUSH_COBUILD_CONTEXT_ID is missing, then the entire cobuild feature is disabled.
Earlier, we mentioned that CI builds sometimes fail due to transient errors. Imagine that a person goes to the GitHub website and clicks a button to "Re-run this job." Successfully built projects should be restored from the cache, but people expect that failed projects should be retried. To tackle this, the cobuild feature uses
RUSH_COBUILD_CONTEXT_ID to ensure it is different even for re-runs of the same "job." This also explains why we need to be careful when specifying this variable. While preparing the documentation for
RUSH_COBUILD_CONTEXT_ID, I was surprised to find that popular CI products have very different meanings for basic terminology such as "build," "job," or "run."
Consolidating build logs
When Rush caches a project, it has always included a .log file with the stderr/stdout from the build task, but this is just for diagnostic purposes. When a project is restored from the cache, the old log output was not normally displayed.
These expectations change with cobuilds: If we have three separate VM runners, our CI system will show three separate build logs. Which website tab contains the log output for a given project? It is a random selection. We decided that it would be better for every VM to display the full log for every project, which means we need to print the .log file when restoring a project from the cache.
What about failed projects? As mentioned above, the cobuild feature creates an empty cache entry for failed projects, so we realized that entry must include the .log file. We call it a "log-only" cache entry.
Leaf project log-only mode
Over the past year, TikTok has been migrating hundreds of preexisting projects into our monorepo. These projects originated in other repos with disparate toolchains and practices, so we cannot safely enable the build cache for them without carefully reviewing the inputs/outputs for their individual toolchains. (If your company has a similar situation, we created a plugin called rush-audit-cache-plugin to facilitate this analysis -- give it a try!)
This put us in a somewhat unusual situation of having many projects that have disabled build caching. How can the cobuild orchestrator communicate the status of such projects? We solved it by reusing the same "log-only" cache entry described above.
But there is a nuance: Rush doesn't attempt to build projects that depend on a failed project. But this isn't true for projects with a disabled cache. It seems Rush has to rebuild them on every VM. Or does it?
Thinking more deeply, only one computer (the primary VM runner) needs to have the build outputs at the end of a cobuild run. And actually, if our purpose is only to validate code correctness (and not to produce an artifact for deployment), then we only need build outputs for projects that are dependencies of other projects. The "leaf" projects in the dependency graph don't need to be built at all, if Redis tells us that it was already built by another cobuild runner. These "leaf" projects are typically large apps, so the savings can be significant.
This insight inspired an environment variable
RUSH_COBUILD_LEAF_PROJECT_LOG_ONLY_ALLOWED=1. With this set, Rush will allow the tasks of leaf projects (projects without any dependency projects in the monorepo) with build cache “disabled” to cache and restore log files when getting cobuilds.
RUSH_COBUILD_LEAF_PROJECT_LOG_ONLY_ALLOWED=1 solves the problem for "leaf" projects in the dependency graph. But what about other projects that have build cache disabled?
Suppose project X disables the cache, and project Y depends on project X. Then we have two options:
- Project Y must be built by the same VM that build X;
- OR, Y must be rebuilt on each machine that needs it.
Option 1 does seem like the better choice. This led to another feature idea: the cobuild can cluster operations based on the dependency graph. Here's an example of two clusters:
- Cluster 1: Operation for A has enabled build cache and no other dependencies with build cache disabled.
- Cluster 2: Operation for B and C groups in the same cluster and will get build in the same machine.
In this design, the Redis lock key is no longer tied to a specific project but rather to the cluster. In other words, our cobuild scheduling algorithm treats a cluster of projects as if it was one single project. This is why a
cluster_id is included in the key in our final design. It ensures that operations within the same cluster share the same lock, preventing conflicts and ensuring proper synchronization. Note that for monorepo where every project has caching enabled, the scheduling behavior will be identical as without this feature.
We said that the cobuild scheduling algorithm treats a cluster as if it was a single project, but Rush's task orchestrator still sees them as individual projects (operating in parallel within a single VM). This effectively means that a single Rush process many need to acquire the same lock multiple times, which required converting the algorithm to use a reentrant lock.
Testing the feature
After developing these features and working out all the necessary details, I had reached a significant milestone in my journey. However, before enabling them in our production monorepo, we needed to conduct some testing to ensure correctness and benchmarking to estimate the speed improvements.
For TikTok, our primary focus is to optimize the CI job that builds every project in the monorepo. Let's call this the "Build All" pipeline. An informal survey of Rush maintainers gave some timing expectations for the "Build All" pipeline in a TypeScript environment: 30 minutes is reasonable. 45 minutes is slow. Much more than an hour is unacceptable. This is the worst case of course -- most pull requests should only involve a subset of projects, and so their build times will be much faster.
“A principle of Rush is that the pain of build times should be proportional to the number of projects that could be broken by your changes.”
For cleaner benchmarking, I decided to create a simulated monorepo that mirrors the structure and relationships of the actual TikTok monorepo. I called it the "TikTok shadow monorepo". This replica encompasses over 700 projects, each with identical package names and corresponding relationships to their counterparts in the real monorepo. I also collected data from our real monorepo to develop a mock building script that accurately reproduces the build times for each project. This approach allowed me to simulate and analyze the impact of potential changes or enhancements without affecting our production environment.
For the key/value provider, I implemented a Rush plugin for accessing our in-house proprietary Redis-like service.
- Cobuild disabled: No distributed builds
- Cobuild enabled: with 2 extra instances for the same build job.
- Case 1: No cache hits
- Case 2: All cache hits
- Case 3: Some building processes will fail randomly.
- Key/value server disconnected
- Leaf project log-only mode
1 hour and 28 mins
Cobuild enabled with 2 extra pipelines
1 hour and 18 mins
As I mentioned before, in TikTok Monorepo, lots of projects have not enabled build cache.
Cobuild enabled with
The power of distributed execution!
Cobuild enabled with
The power of build cache!
Cobuild enabled with
Consistent performance results with failing builds.
Key/value server disconnected
When key/value server is unavailable, the Rush build fails and aborts immediately.
As evident, the leaf project log-only mode is vital to TikTok’s web monorepo, especially when there is low coverage of build cache configuration. With the activation of this unique mode and the addition of two extra pipelines, we can achieve a tremendous performance improvement of approximately 60%, equivalent to a reduction of 52 minutes.
Ideas for future improvements
Considering machine load
With these results, I was eager to try cobuilds in our production monorepo, but this was delayed by additional work to merge this feature into our fork of Rush. As members of the open source community started testing the prerelease in their monorepos, we received reports that when the cobuild run first starts, all the work is sometimes assigned to a single VM runner. This happens when roots of the dependency graph do not have enough parallelism to exceed Rush's
--parallelism count, which determines how many Rush projects can build in parallel on a single computer. The default value is based on the number of CPU cores.
This was our intended design, but it assumes that every VM is equal, when in fact a machine under heavy load is less desirable than an idle machine. A workaround is to limit the number of parallel processes for Rush. For example, by specifying
--parallelism=25% for a machine with 16 cores, Rush will be limited to 4 parallel processes on each VM.
In the future, it should be fairly easy to use Redis to communicate machine load statistics, and I think we should pursue that. However, reportedly the 25% workaround already produces reasonably good behavior.
Task scheduler improvements
There are a couple other obvious opportunities to enhance Rush's strategy for task scheduling across machines:
For clustering of operations, the algorithm could anticipate all the clusters within a cobuild run.
The number of cobuild runner VMs is also an important variable to optimize. In my initial tests, it is a fixed number, but obviously a job that builds a small number of projects should need less VMs. From my investigations of various CI products, it is straightforward for a program script to specify the number of VMs dynamically. It could be a simple heuristic (using
rush list --to git:origin/main for example), but I think Rush should provide a standard mechanism for calculating this.
Engineers often watch their build output in realtime to monitor progress and see whether errors have occurred. Rush has always provided realtime line-by-line output (cleverly "collated" by project, rather than interleaved). It looks like this:
Selected X operations:
==[ my_package_name (build) ]=====================================[ 1 of X ]==
This project was not found in the local build cache. Querying the cloud build cache.
This project was not found in the build cache.
Invoking: my-build my_package_name
Use predefined duration 730ms.
Caching build output folders: dist
Trying to find "tar" binary
Upload build cache to remote success.
Successfully set cache entry.
"my_package_name (build)" completed successfully in 1.06 seconds.
In a traditional build, all tasks are executed on a single machine, so obtaining realtime output is easy. But with the cobuild feature, much of the log output will be restored from the cache. This is no longer line-by-line, but instead comes in big chunks as each cache entry is restored. As the logs appear at a higher latency, it may give a psychological impression of sluggishness. However, the underlying computation runs at the same speed.
Rather than further complicating the cobuild design, we could tackle this separately, by creating a centralized dashboard for monitoring Rush progress. This would be a separate feature that would also provide a richer monitoring experience even without cobuilds.
Things often start out simple, but they tend to become complicated quickly. However, one of the beauties of computer science is the ability to control and manage this complexity effectively. Implementing the cobuild feature for Rush was indeed a challenging yet rewarding process for us. Not only did it enhance my development skills, but also taught invaluable lessons in problem-solving, patience, and resilience.