The definition and measurement of service performance seeks to answer the following:
What is the target performance of each service in terms of availability, latency, throughput or other metrics?
How to differentiate the criticality of different services in terms of target performance?
How to differentiate between pre-event and event operations?
When are the support team available to respond to service issues?
How to categorise the impact of different types of service issues (incidents) and use that to drive fix times?
How to measure service performance?
How to communicate service issues depending on their impact
Each of these questions is dealt with below. As a basic principle for major events the aim is to create a single standardised set of performance targets and corresponding definitions. In a mission-critical, multi-vendor environment that operates for a short period of time it helps to have everyone working towards the same goal and avoid sub-optimisation.
Service Levels (aka Performance Measures)
As a basis for defining service performance, Site Reliability Engineering (SRE) – as developed initially by Google, provides a good framework to differentiate between:
things you measure and track that indicate service performance => Service Level Indicators
things you measure and track AND set internal targets for => Service Level Objectives
targets that have are contractual and have consequences if not met => Service Level Agreements
Indicators of service performance
Availability, latency, error rate, throughput, packet loss, jitter, incident response time & resolution time, support ticket volumes
Targets for service performance
Availability > 99.9%
99% of requests served within 800ms
Packet Loss < 0.1%
Severity 2 incident resolution in less than 8 hours
Critical security patches deployed within 72 hours after release
Targets with Consequences if not met
A subset of contracted SLOs with consequences for not achievement.
E.g. Availability > 99.9%
10% of total monthly fees due for the month where SLA is not achieved for respective Service
Service Level Objectives have the concept of an error budget, which is the allowable %, time or quality where the target is not achieved.
For example, if an application has 99.9% availability, measured on a monthly basis, then it can be not available for around 43 minutes in that month. A 30-minute outage at the start of the month will leave an error budget of 13 minutes for the rest of the month. By tracking the error budget (good monitoring tools can do this for you) it provides a sense of the risk of implementing potentially risky changes.
More details on SRE Service Level Objectives can be found here. Or Atlassian’s view of this here.
Defining Service Levels
Nobody wants their service to be unavailable or slow, but perfection is expensive and, in most cases, not noticeable by those using it. People use laptops or mobile phones that crash, they sometimes use slow internet connections or they can find another way to get something done if needed. So, aiming for 99.99% availability (4 minutes and 19 seconds of downtime per month) for every service is not needed.
While each service should be considered individually with the respective business owner, when starting out, the following service criticality can be used as a guide:
Breaks in service are intolerable and could be damaging from a reputational and financial perspective.
Impacts of a service outage may include:
Short breaks in service can be tolerated without reputational or financial impact.
Impacts of a service outage may include:
Contributing to efficient business operation but out of direct line of service to external customers.
Higher SLA’s cost more. Achieving better SLA performance in most cases is a question of design resilience and redundancy. That adds complexity which adds cost.
Higher SLAs’ can impact the rate of innovation. A low tolerance for service failures results in a very risk averse approach to releasing new features. So the right balance needs to be found and this may be time dependent.
For some services deployed at different sites or venues, the SLA may differ based on the site. For example if a network at a venue has no switch redundancy then the availability target cannot be the same as a venue where there is redundancy.
The pre-event period requires less stringent SLA’s and typically business hours support during working days is sufficient for most services.
During the event operations period (aka the critical operations period) higher SLA targets are needed and most services are expected to operate 24 x 7.
These two periods are known as the critical operations period and non-critical operations period
Event operations period
BAU operations period
Other times – aka Business-As-Usual (BAU)
Note 1: The event operations period will start a few weeks prior to the day of opening ceremony depending on earlier events and key customer operations (e.g. broadcasters).
The management of incidents and major incidents is outlined in the Incident management process and Major Incident management process.
A mission/business critical technology service is not operational
A mission/business critical technology service is impaired, but operation can continue in a restricted manner
A mission/business critical service is at risk of being not operational but not currently impacted
A business operational technology service is not functioning
A component of a technology service is not operational and the delivery of a service is impaired but still functional, if untreated could become a higher severity incident
A small portion of users of a service are affected and cannot perform their required tasks
A single user is impacted in their use of a technology service or device and cannot perform their tasks
Incident severity is based on business impact. Each event can create corresponding examples for each severity level to assist with training and as a guide during operations.
Response & Resolution Targets
(BAU operations period)
(Event operations period)
Response Time - The elapsed time from when a support group is notified about ticket to when an individual within that support group acknowledges receipt of the Incident or Service Request and accepts responsibility for it.
Resolution Time - The elapsed time from when a ticket for an Incident or Service Request is logged in the ITSM tool until it is resolved (i.e. fixed) permanently or via workaround.
With auto-assignment of tickets in the ITSM tool, there can be minimal delay in a ticket being assigned to the right support person.
What is the achievement threshold?
The achievement threshold is the percentage of incidents or service requests within the category that meet the resolution time target. For example if there are 50 Severity 3 incidents within a measurement period and 45 of them meet the resolution time target and 5 do not, then the target has been met 90% of the time. This threshold provides an allowance for missing the resolution target occasionally and the lower the number of tickets, the lower the threshold should be down to a minimum of say 80%.
Measuring Service Levels
Service levels are measured through the monitoring tool(s) and ITSM tool. Each defined service level needs to be monitored and alerting configured to contribute to resolving incidents before they exceed their error budget.