How to Improve Quality?

What to focus on?

Product engineers main purpose is to deliver value to customers in the form of features (code). The quality metrics should reflect this purpose.

There are 3 types of quality issues

Incidents: When a user cannot use the system.
Production Bugs: When a user fails to use a feature.
Staging & Development Bugs: Errors or issues identified pre-production. Doesn’t have impact on the user.

There is major distinction between top 2 parts of the pyramid and the bottom layer.

The top 2 parts reflect “user pain”. Bottom part reflects “development pain”. As a business we should mainly care about users and engineering teams should align with business goals.

Product Engineering teams should track customer facing bugs AKA

Incidents
Production Bugs

Incidents & Production Bugs

reflects user pain
looked periodically (weekly/monthly)
reported regularly (see First Principles of Engineering Metrics )

Staging & Development Bugs

reflects “development pain”
used for optimization efforts (like implementing CI/CD, TDD, PR Review processes etc.)
should NOT be reported

How to use metrics for non-customer facing bugs

There are a lot of bugs created from

Logs (Sentry etc.)
Staging testing
Code Review
Unit/integration/E2E tests
QA
…

The bugs generated from these channels might be as important as customer facing ones. The goal however is to ensure that customers face less bugs. To accomplish this goal, we should ensure bugs from staging & development does not release in front of customers.

Just like product team tracks main product metric like DAU, but also track every single button click, product engineering teams should track customer facing bugs as main metric & track non-customer facing bugs like button click.

A product manager doesn’t look at a button click analytic regularly. Same should apply for non-customer facing bugs. We should only look at them when we think our processes are broken and we need to improve our internal processes

What metrics to track?

When using metrics, we should look into dual metric balancing between efficiency and effectiveness.

Bug Resolution Time: How fast do we solve those bugs for the users?

Bugs Created: How many bugs does our users get?

If we don’t track via dual metrics we’ll have weird scenerios like “We resolve a bug in 1 day on average, but have 1000 bugs created per month. To avoid scenerios like these we use dual metrics.

In this case

Efficiency is Bug Resolution Time
Effectiveness is Bugs Created

Advance tips:

Use week or month as timeframe when tracking these metrics
If you are constantly hiring, you may want to normalize the effectiveness metric as Bugs Created per Member
The definition of a bug & incident should be clear. You can find more info in How should bugs be tracked? section
Use Issue Lead Time (time from issue creation to issue completed) when calculating Bug Resolution Time.
In some organizations rather than Bugs Created, they use Bugs Resolved. We suggest keeping it as “Bugs Created”.

How should bugs be tracked?

Incidents & Product Bugs can be tracked multiple ways. I’ll go over the top ways major organizations track.

Note: Each organization has different needs and requirements. Feel free to edit the the options in a way it fits your needs.

Note2: Below options are based on Jira users. It’s possible to do something similar in all issue management platforms.

Most teams tracks bugs wrongly.

Tracking bugs correctly typically requires process changes.

Option #1: For Small Organizations

Each issue with type: bug is considered as a quality fault
Incidents are tracked as priority: highest bugs.
Production bugs are tracked as priority: normal/high bugs.
Staging & Development bugs are tracked as priority: low bugs.

Benefits

It’s simple
Better for small organizations

Cons

Requires alignment & education across all the team

Option #2: For Big Organizations

All incidents SHOULD be created by Incident Management tooling (pagerduty, opsgenie etc.) creates an issue with type: Incident
All customer issues SHOULD be created by your Support Suite tooling (zendesk, intercom etc.). Customer success team creates type: Customer Bugs
Any internally catched bugs should be created as type: bug

Benefits

System enforces correct tagging which gets rid of alignment & education
Better for large organizations

Cons

It takes change management across DevOps, Product, Engineering, CS teams

Tactical advice on improving quality

Improving anything goes through the same process

Check metrics
Identify where to improve
Brainstorm on bets
Implement the bet
Check metrics if the bet resulted in succeed
Repeat

Action we take should either

Decrease time it takes us to fix a bug (Bug Resolution Time)
Decrease how many bugs we release to customers (Bugs Created)

1. Check Metrics

You should have board like the following where you track Quality as part of your operational metrics.

2. Identify where to improve

At this point you have 2 options

Debug a single issue

If you click any graph you’ll see all the issues in that week.

Debug high level patterns

Click on the graph and group by the field you’d like to understand the patterns of

Most common group by options are

Team
Priority

Once you identify where you should be focusing on high level, next step is to find the pattern.

Click on the identified team/priority/group. You’ll see all the related issues

From this view, you’ll need to check multiple metadata and try to see what might be the pattern. In this image it seems like a lot of issues get stuck on the QA status (teal column) in Jira board.

Now we have identified where to improve we can go into next step.

3. Brainstorm on bets

Once you understand what the problem is, next step is to brainstorm on potential fixes on the problem.

Always look for root cause of the problem. Use following template to find root cause

What is the Root cause
What is the Customer impact
What action can we take to prevent this from happening again?

4. Implement the bet

Execute on the idea we have that would fix the bet.

5. Check metrics if the bet resulted in succeed

Once we have implemented the fix, we should

Check the metrics if it improved either Bug Resolution Time or Bugs Created
Check if we have the same issue happening again.

To improve we need to ensure no bug comes twice.

6. Repeat

If we repeat this action a few dozens times - typically in 2-6 weeks - we’ll see drastic improvements across our engineering quality.

Last updated 1 year ago

Was this helpful?