Developer expectations for on-call
On-call is about more than just reacting. It's about setting standards, advocating for better practices, and taking ownership. Effective on-call engineers not only point out problems but also propose solutions and lead by example.
At TikTok, we use the owner-operator model rather than a "follow-the-sun" approach. Unlike the follow-the-sun model, where operational work is handed off between teams in different time zones, the owner-operator model means that you, as the developer, are also the operator. When issues arise, alerts come directly to you, regardless of the time.
However, on-call should address major issues, not just routine maintenance. If you're constantly dealing with emergencies outside regular hours, something might be broken in the system or the on-call process. Disrupted sleep is unsustainable, and no system should experience nightly disasters.
Disaster mitigation tips for on-call success
Here are some tips to make your on-call experience more manageable and effective:
Schedule deployment wisely
Most deployment issues occur within the first three hours. When deciding what time to roll out changes, take timing into consideration. Careful deployment scheduling can minimize human impact and reduce the risk of on-call emergencies.
Optimize your alarms
False alarms are a common frustration. It's your responsibility as an owner to analyze thresholds and update the alarm system to reduce noise. A higher threshold might miss minor events, while a lower one can flood you with alerts. Find the right balance.
Automate mitigation
Automating responses to predictable issues can prevent unnecessary wake-up calls. For example, if a server runs out of memory, a script can automatically restart it. Instead of waking up at 2 AM, you get a ticket to investigate at 9 AM. This approach turns emergencies into manageable tasks.
Manage changes and operational costs
During peak traffic periods like Christmas Eve or Lunar New Year, code freezes are common. Systems rarely fail when left untouched. Mimimize changes to reduce risk.
However, eliminating changes altogether isn't viable either. Instead, effective change management is necessary. At TikTok, we emphasize smaller merge requests (MRs), fast deployments, and careful rollback plans. Smaller, frequent changes are easier to manage and roll back if something goes wrong. This reduces the chance of catastrophic failures and allows for quick recovery.
Avoid big bang developments, as they're more impactful and any mistakes can cause large failures. Also, be mindful of deploying during periods when issues typically arise. If you know problems typically occur within 2–5 hours after deployments, avoid scheduling changes during those times. You don't want to be dealing with deployment issues or performing rollbacks when you're heading out for dinner.
Make use of continuous integration and continuous delivery (CI/CD) problems to address problems early in the software development lifecycle. The sooner you identify and solve issues, the easier it will be.
Have a clear rollback plan
Always have a rollback plan and a description of the worst case scenario. Understand what systems your change will affect and consider the potential biggest image of a failure. If a change fails, avoid relying on roll-forward fixes that require diagnosing issues under pressure, especially at night or when the team is unavailable. A clear, documented rollback plan is essential.
Handle multiple issues effectively
When on call, you may face multiple issues simultaneously. Effective triage and classification are crucial:
- Assess the impact of each alert. Prioritize based on risk and urgency.
- Regular updates are vital. Keep stakeholders informed with clear, concise updates on progress, focus, and next steps. For high-impact events, aim for updates every 15 minutes.
- Owner-operators are not just engineers, they are also leaders. Understand your systems deeply and push for process improvements to enhance your team’s ability to handle operational issues.
Being a great on-call engineer requires a mix of technical skill, operational discipline, and leadership. By optimizing alarms, automating mitigation, managing deployments carefully, and always having a rollback plan, you can make on-call duty less stressful and more effective. Remember, on-call is about ownership, not just reaction—embrace responsibility, and you'll excel!