The Site Reliability Engineer role suits experienced operators and engineers who combine software development and systems engineering to ensure resilient, performant production services. Candidates should apply if they are comfortable partnering with development teams, taking on-call responsibility and driving continuous improvement in availability and incident reduction.
Site Reliability Engineer Job Profile
The Site Reliability Engineer is responsible for maintaining and improving the reliability, availability and performance of critical services. This role balances proactive engineering work with operational duties to prevent incidents and reduce mean time to recovery when faults occur.
The purpose of the role is to embed reliability practices across the service lifecycle, implement automation to reduce toil, and collaborate with product and engineering teams to maintain defined service targets and operational readiness.
Site Reliability Engineer Job Description
Site Reliability Engineers design, build and maintain the systems and processes that support production services. They define and measure service-level indicators and objectives, develop automation to support deployment and recovery, and implement monitoring and alerting to detect degradation early. The role requires working across teams to embed reliability requirements into designs and deployments.
In day-to-day work, the role includes participating in incident response, conducting root cause analysis, and delivering remediation to prevent recurrence. The engineer is expected to document runbooks and operational procedures, improve observability and capacity planning, and support release management and change control to minimise risk to live services.
Site Reliability Engineer: Duties and Responsibilities
- Define, monitor and report on service level indicators, objectives and agreements to measure reliability performance.
- Design and implement automation to reduce manual operational tasks and improve deployment consistency.
- Develop and maintain monitoring, logging and alerting to detect and communicate service issues proactively.
- Participate in on-call rotations to respond to incidents and restore services within agreed targets.
- Lead incident triage and post-incident reviews, producing actionable root cause analysis and remediation plans.
- Collaborate with development teams to introduce reliability requirements into architecture and release planning.
- Author and maintain runbooks, operational playbooks and run-time documentation for support teams.
- Perform capacity planning and performance tuning to ensure services meet current and projected demand.
- Manage change and release activities to minimise operational risk and ensure smooth deployments.
- Design and validate disaster recovery and business continuity procedures for critical systems.
- Conduct reliability-focused testing such as failure injection and chaos experiments to validate resilience.
- Identify and remediate security and configuration issues that could affect service availability.
- Drive continuous improvement initiatives to reduce incident frequency and shorten recovery times.
- Provide technical guidance and mentorship to engineering and operations colleagues on reliability best practice.
Site Reliability Engineer: Requirements and Qualifications
- Bachelor degree in Computer Science, Engineering or a related technical discipline, or equivalent practical experience.
- Minimum three years experience in systems engineering, production operations or a related SRE role.
- Proven experience with automation and scripting to support deployment, testing and maintenance activities.
- Strong understanding of networking concepts, system architecture and distributed system behaviour.
- Demonstrable experience with monitoring, observability and incident management practices.
- Experience developing and using runbooks, incident playbooks and post-incident analysis techniques.
- Ability to diagnose performance issues and undertake capacity planning and performance optimisation.
- Familiarity with infrastructure as code, configuration management and continuous delivery principles.
- Clear technical written and verbal communication skills for cross-team collaboration and documentation.
- Ability to prioritise tasks, handle competing demands and work effectively under pressure during incidents.
- Comfortable participating in an on-call rota and supporting out-of-hours incident response when required.
- Commitment to continuous learning and staying current with reliability engineering practices and patterns.
