Service Performance

Overview

The definition and measurement of service performance seeks to answer the following:

What is the target performance of each service in terms of availability, latency, throughput or other metrics?
How to differentiate the criticality of different services in terms of target performance?
How to differentiate between pre-event and event operations?
When are the support team available to respond to service issues?
How to categorise the impact of different types of service issues (incidents) and use that to drive fix times?
How to measure service performance?
How to communicate service issues depending on their impact

Each of these questions is dealt with below. As a basic principle for major events the aim is to create a single standardised set of performance targets and corresponding definitions. In a mission-critical, multi-vendor environment that operates for a short period of time it helps to have everyone working towards the same goal and avoid sub-optimisation.

Service Levels (aka Performance Measures)

As a basis for defining service performance, Site Reliability Engineering (SRE) – as developed initially by Google, provides a good framework to differentiate between:

things you measure and track that indicate service performance => Service Level Indicators
things you measure and track AND set internal targets for => Service Level Objectives
targets that have are contractual and have consequences if not met => Service Level Agreements

Type	Definition	Examples
SLI	Indicators of service performance	Availability, latency, error rate, throughput, packet loss, jitter, incident response time & resolution time, support ticket volumes
SLO	Targets for service performance	Availability > 99.9% 99% of requests served within 800ms Packet Loss < 0.1% Severity 2 incident resolution in less than 8 hours Critical security patches deployed within 72 hours after release
SLA	Targets with Consequences if not met	A subset of contracted SLOs with consequences for not achievement. E.g. Availability > 99.9% 10% of total monthly fees due for the month where SLA is not achieved for respective Service

Error budgets

Service Level Objectives have the concept of an error budget, which is the allowable %, time or quality where the target is not achieved.

For example, if an application has 99.9% availability, measured on a monthly basis, then it can be not available for around 43 minutes in that month. A 30-minute outage at the start of the month will leave an error budget of 13 minutes for the rest of the month. By tracking the error budget (good monitoring tools can do this for you) it provides a sense of the risk of implementing potentially risky changes.

More details on SRE Service Level Objectives can be found here. Or Atlassian’s view of this here.

Defining Service Levels

Nobody wants their service to be unavailable or slow, but perfection is expensive and, in most cases, not noticeable by those using it. People use laptops or mobile phones that crash, they sometimes use slow internet connections or they can find another way to get something done if needed. So, aiming for 99.99% availability (4 minutes and 19 seconds of downtime per month) for every service is not needed.

While each service should be considered individually with the respective business owner, when starting out, the following service criticality can be used as a guide:

Type

Examples

Mission Critical

Breaks in service are intolerable and could be damaging from a reputational and financial perspective.

Impacts of a service outage may include:

Significant reduction in revenue or potential revenue
Negative media publicity if outage is sustained or repetitive
Damaging for the company's commercial reputation and credibility
Critical safety issue

Business Critical

Short breaks in service can be tolerated without reputational or financial impact.

Impacts of a service outage may include:

Significant inconvenience/ dissatisfaction for key customer groups
Inability to collect revenue efficiently
Long-term outage can significantly impact revenue generation, reputation and credibility

Business Operational

Contributing to efficient business operation but out of direct line of service to external customers.
Impacts of a service outage may include:

Reduced productivity of groups of business users, impacting their ability to work
Reduces capability to deliver on departmental objectives

Some considerations:

Higher SLA’s cost more. Achieving better SLA performance in most cases is a question of design resilience and redundancy. That adds complexity which adds cost.
Higher SLAs’ can impact the rate of innovation. A low tolerance for service failures results in a very risk averse approach to releasing new features. So the right balance needs to be found and this may be time dependent.
For some services deployed at different sites or venues, the SLA may differ based on the site. For example if a network at a venue has no switch redundancy then the availability target cannot be the same as a venue where there is redundancy.

Operational Periods

The pre-event period requires less stringent SLA’s and typically business hours support during working days is sufficient for most services.

During the event operations period (aka the critical operations period) higher SLA targets are needed and most services are expected to operate 24 x 7.

These two periods are known as the critical operations period and non-critical operations period

Type

When

What

Event operations period

Event duration¹

Test Events

Tech Rehearsals

24 x 7 service operations
Higher SLA targets (availability, Response & resolution, etc)

BAU operations period

Other times – aka Business-As-Usual (BAU)

Business hours + on-call support for major incidents
Standard SLA targets (availability, Response & resolution, etc)

Note 1: The event operations period will start a few weeks prior to the day of opening ceremony depending on earlier events and key customer operations (e.g. broadcasters).

Incident Severity

The management of incidents and major incidents is outlined in the Incident management process and Major Incident management process.

Severity	Description
1	A mission/business critical technology service is not operational
2	A mission/business critical technology service is impaired, but operation can continue in a restricted manner A mission/business critical service is at risk of being not operational but not currently impacted A business operational technology service is not functioning
3	A component of a technology service is not operational and the delivery of a service is impaired but still functional, if untreated could become a higher severity incident A small portion of users of a service are affected and cannot perform their required tasks
4	A single user is impacted in their use of a technology service or device and cannot perform their tasks

Incident severity is based on business impact. Each event can create corresponding examples for each severity level to assist with training and as a guide during operations.

Response & Resolution Targets

Severity	Response Time	Resolution Time (BAU operations period)	Resolution Time (Event operations period)	Achievement Threshold
Severity 1	15 mins	4 hours	1 hour	90%
Severity 2
Severity 3
Severity 4
Service Request

Response Time - The elapsed time from when a support group is notified about ticket to when an individual within that support group acknowledges receipt of the Incident or Service Request and accepts responsibility for it.

Resolution Time - The elapsed time from when a ticket for an Incident or Service Request is logged in the ITSM tool until it is resolved (i.e. fixed) permanently or via workaround.

With auto-assignment of tickets in the ITSM tool, there can be minimal delay in a ticket being assigned to the right support person.

What is the achievement threshold?

The achievement threshold is the percentage of incidents or service requests within the category that meet the resolution time target. For example if there are 50 Severity 3 incidents within a measurement period and 45 of them meet the resolution time target and 5 do not, then the target has been met 90% of the time. This threshold provides an allowance for missing the resolution target occasionally and the lower the number of tickets, the lower the threshold should be down to a minimum of say 80%.

Measuring Service Levels

Service levels are measured through the monitoring tool(s) and ITSM tool. Each defined service level needs to be monitored and alerting configured to contribute to resolving incidents before they exceed their error budget.