The core premise for SRE lies in treating operations as a software problem where operations are concerned with addressing availability, scalability, latency and efficiency for Booking.com’s systems & services. At its core the SRE is tasked with engineering efforts to solve complex problems, requiring a strong aptitude to develop software systems that will minimize (i.e. through automation) human labor and increase system & service reliability.
A Booking Reliability Engineering team has full vertical ownership of a system, from the server configuration up to the application interfaces. This enables the team to have full control on a service, and avoid situations where different teams own different areas of a system and some parts fall between the cracks.
SRE can wear several hats; at times an SRE might be part of the product development team themselves and other times will act as a consultant to support a product development team to implement the Booking Reliability Engineering best practices. As systems & services grow in size and complexity so too does the operational overhead. It is a fundamental principle of SRE to break this relationship between operational toil, system size and complexity. This also requires the team to limit operations work enforcing engineering development efforts that is at the heart of Booking Reliability Engineering.
Ultimately the fundamental software engineering skills coupled with strong systems and networking knowledge will guide the SRE to create more reliable systems & services that are highly available, which scales with growth and that is efficient and latency sensitive.
A Senior SRE has the additional responsibilities of fostering an active and thriving SRE community, leading the community by example of being an advocate of engineering, reliability and security best practices.
Building software applications
- Is responsible to build software applications by using relevant development languages and applying knowledge of systems, services and tools appropriate for the business area and guide more junior members of the team in this topic.
- Is responsible to write readable and reusable code by applying standard patterns and using standard libraries and guide more junior members of the team in this topic.
- Is responsible to refactor and simplify code by introducing design patterns when necessary and guide more junior members of the team in this topic.
- Is responsible to ensure the quality of the application by following standard testing techniques and methods that adhere to the test strategy and guide more junior members of the team in this topic.
- Is responsible to maintain data security, integrity and quality by effectively following company standards and best practices and guide more junior members of the team in this topic.
End to End System Ownership
- Is responsible to own a service end to end by actively monitoring application health and performance, setting and monitoring relevant metrics and act accordingly when violated and guide more junior members of the team in this topic.
- Is responsible to reduce business continuity risks and bus factor by applying state-of-the-art practices and tools, and writing the appropriate documentation such as runbooks and OpDocs and guide more junior members of the team in this topic.
- Is responsible to reduce risk and obtain customer feedback by using continuous delivery and experimentation frameworks and guide more junior members of the team in this topic
- Is responsible to independently manage an application or service by working through deployment and operations in production and guide more junior members of the team in this topic.
Software Systems Design
- Is responsible to evaluate possible architecture solutions by taking into account cost, business requirements, technology requirements and emerging technologies
- Is responsible to describe the implications of changing an existing system or adding a new system to a specific area, by having a broad, high-level understanding of the infrastructure and architecture of our systems
- Is responsible to help grow the business and/or accelerate software development by applying engineering techniques (e.g. prototyping, spiking and vendor evaluation) and standards
- Is responsible to meet business needs by designing solutions that meet current requirements and are adaptable for future enhancements
Technical Incident Management
- Is responsible to address and resolve live production issues by mitigating the customer impact within SLA and guide more junior members of the team in this topic.
- Is responsible to improve the overall reliability of systems by producing long term solutions through root cause analysis and guide more junior members of the team in this topic.
- Is responsible to keep track of incidents by contributing to postmortem processes and logging live issues and guide more junior members of the team in this topic.
- Has sufficient knowledge to coach, guide and improve the overall performance of stakeholders and colleagues at all levels, when appropriate, by sharing experience, knowledge and approaches to work
Automation and toil reduction
- Is responsible to ensure that infrastructure stays current by reducing technical debt, searching for bottlenecks and preparing for scaling and guide more junior members of the team in this topic.
- Is responsible to reduce cost of operations and maintenance by leveraging new technologies, automation, and partner with vendors to ensure we stay current and guide more junior members of the team in this topic.
- Is responsible to reduce human labour by writing small software features that address availability, scalability, latency and efficiency and guide more junior members of the team in this topic.
Monitoring and Alerting improvements
- Is responsible to review and verify performance of production systems and network infrastructure by continuously monitoring appropriate observability metrics, business KPIs and capacity planning and guide more junior members of the team in this topic.
- Is responsible to improve application reliability by partnering with development teams to advise on setting appropriate observability metrics and guide more junior members of the team in this topic.
- Has sufficient knowledge to advise product teams towards a technical solution that meets the functional, nonfunctional & architectural requirements by challenging the rationale for an application design and providing context in the wider architectural landscape
- Has sufficient knowledge to set a clear direction for a technical capability by evaluating and aligning the target architecture improvements, reframing architectural designs and decisions for varied stakeholder
- Is responsible to systematically identify patterns and underlying issues in complex situations, and to find solutions by applying logical and analytical thinking.
- Is responsible to constructively evaluate and develop ideas, plans and solutions by reviewing them, objectively taking into account external knowledge, initiating 'SMART' improvements and articulating their rationale.
Continuous Quality and Process Improvement
- Is responsible to identify opportunities for process, system and structural