TikTok for Developers
A Developer's Guide to On-Call
by Kaiyuan Gan, Software Engineer, TikTok
Tech @ TikTok

Developer expectations for on-call

On-call is about more than just reacting. It's about setting standards, advocating for better practices, and taking ownership. Effective on-call engineers not only point out problems but also propose solutions and lead by example.

At TikTok, we use the owner-operator model rather than a "follow-the-sun" approach. Unlike the follow-the-sun model, where operational work is handed off between teams in different time zones, the owner-operator model means that you, as the developer, are also the operator. When issues arise, alerts come directly to you, regardless of the time.

However, on-call should address major issues, not just routine maintenance. If you're constantly dealing with emergencies outside regular hours, something might be broken in the system or the on-call process. Disrupted sleep is unsustainable, and no system should experience nightly disasters.

Disaster mitigation tips for on-call success

Here are some tips to make your on-call experience more manageable and effective:

Schedule deployment wisely

Most deployment issues occur within the first three hours. When deciding what time to roll out changes, take timing into consideration. Careful deployment scheduling can minimize human impact and reduce the risk of on-call emergencies.

Optimize your alarms

False alarms are a common frustration. It's your responsibility as an owner to analyze thresholds and update the alarm system to reduce noise. A higher threshold might miss minor events, while a lower one can flood you with alerts. Find the right balance.

Automate mitigation

Automating responses to predictable issues can prevent unnecessary wake-up calls. For example, if a server runs out of memory, a script can automatically restart it. Instead of waking up at 2 AM, you get a ticket to investigate at 9 AM. This approach turns emergencies into manageable tasks.

Manage changes and operational costs

During peak traffic periods like Christmas Eve or Lunar New Year, code freezes are common. Systems rarely fail when left untouched. Mimimize changes to reduce risk.

However, eliminating changes altogether isn't viable either. Instead, effective change management is necessary. At TikTok, we emphasize smaller merge requests (MRs), fast deployments, and careful rollback plans. Smaller, frequent changes are easier to manage and roll back if something goes wrong. This reduces the chance of catastrophic failures and allows for quick recovery.

Avoid big bang developments, as they're more impactful and any mistakes can cause large failures. Also, be mindful of deploying during periods when issues typically arise. If you know problems typically occur within 2–5 hours after deployments, avoid scheduling changes during those times. You don't want to be dealing with deployment issues or performing rollbacks when you're heading out for dinner.

Make use of continuous integration and continuous delivery (CI/CD) problems to address problems early in the software development lifecycle. The sooner you identify and solve issues, the easier it will be.

Have a clear rollback plan

Always have a rollback plan and a description of the worst case scenario. Understand what systems your change will affect and consider the potential biggest image of a failure. If a change fails, avoid relying on roll-forward fixes that require diagnosing issues under pressure, especially at night or when the team is unavailable. A clear, documented rollback plan is essential.

Handle multiple issues effectively

When on call, you may face multiple issues simultaneously. Effective triage and classification are crucial:

  • Assess the impact of each alert. Prioritize based on risk and urgency.
  • Regular updates are vital. Keep stakeholders informed with clear, concise updates on progress, focus, and next steps. For high-impact events, aim for updates every 15 minutes.
  • Owner-operators are not just engineers, they are also leaders. Understand your systems deeply and push for process improvements to enhance your team’s ability to handle operational issues.

Being a great on-call engineer requires a mix of technical skill, operational discipline, and leadership. By optimizing alarms, automating mitigation, managing deployments carefully, and always having a rollback plan, you can make on-call duty less stressful and more effective. Remember, on-call is about ownership, not just reaction—embrace responsibility, and you'll excel!


Share this article
Discover more
Highlights from our Privacy Innovation Meetup at ACM CCS 2024
TikTok's Privacy Innovation team hosted a meetup at ACM CCS 2024, showcasing privacy-preserving technologies like ManaTEE and reinforcing the team's commitment to privacy and security through industry and academic collaboration.
Privacy
Community
A Recap of DevDay 2024: TikTok's Inaugural Developer Conference
Our first-ever TikTok DevDay in San Jose was an incredible success! With over 300 developers in attendance, the event provided an immersive experience into TikTok’s growing ecosystem of tools and innovations. Here is the recap blog of our event.
Community
TikTok Donates ManaTEE Open Source Project to the Linux Foundation
TikTok is donating ManaTEE, a platform built on Trusted Execution Environments, to the Linux Foundation’s Confidential Computing Consortium. ManaTEE is designed to address critical challenges in data privacy and security.
Tech @ TikTok
Open source