How to Improve Quality?

What to focus on?

Product engineers main purpose is to deliver value to customers in the form of features (code). The quality metrics should reflect this purpose.

There are 3 types of quality issues

  1. Incidents: When a user cannot use the system.

  2. Production Bugs: When a user fails to use a feature.

  3. Staging & Development Bugs: Errors or issues identified pre-production. Doesn’t have impact on the user.

There is major distinction between top 2 parts of the pyramid and the bottom layer.

The top 2 parts reflect “user pain”. Bottom part reflects “development pain”. As a business we should mainly care about users and engineering teams should align with business goals.

Product Engineering teams should track customer facing bugs AKA

  • Incidents

  • Production Bugs

Incidents & Production Bugs

Staging & Development Bugs

  • reflects “development pain”

  • used for optimization efforts (like implementing CI/CD, TDD, PR Review processes etc.)

  • should NOT be reported

How to use metrics for non-customer facing bugs

There are a lot of bugs created from

  • Logs (Sentry etc.)

  • Staging testing

  • Code Review

  • Unit/integration/E2E tests

  • QA

The bugs generated from these channels might be as important as customer facing ones. The goal however is to ensure that customers face less bugs. To accomplish this goal, we should ensure bugs from staging & development does not release in front of customers.

Just like product team tracks main product metric like DAU, but also track every single button click, product engineering teams should track customer facing bugs as main metric & track non-customer facing bugs like button click.

A product manager doesn’t look at a button click analytic regularly. Same should apply for non-customer facing bugs. We should only look at them when we think our processes are broken and we need to improve our internal processes

What metrics to track?

When using metrics, we should look into dual metric balancing between efficiency and effectiveness.

Bug Resolution Time: How fast do we solve those bugs for the users?

Bugs Created: How many bugs does our users get?

If we don’t track via dual metrics we’ll have weird scenerios like “We resolve a bug in 1 day on average, but have 1000 bugs created per month. To avoid scenerios like these we use dual metrics.

In this case

  • Efficiency is Bug Resolution Time

  • Effectiveness is Bugs Created

Advance tips:

  • Use week or month as timeframe when tracking these metrics

  • If you are constantly hiring, you may want to normalize the effectiveness metric as Bugs Created per Member

  • The definition of a bug & incident should be clear. You can find more info in How should bugs be tracked? section

  • Use Issue Lead Time (time from issue creation to issue completed) when calculating Bug Resolution Time.

  • In some organizations rather than Bugs Created, they use Bugs Resolved. We suggest keeping it as “Bugs Created” due to

How should bugs be tracked?

Incidents & Product Bugs can be tracked multiple ways. I’ll go over the top ways major organizations track.

Note: Each organization has different needs and requirements. Feel free to edit the the options in a way it fits your needs.

Note2: Below options are based on Jira users. It’s possible to do something similar in all issue management platforms.

Most teams tracks bugs wrongly.

Tracking bugs correctly typically requires process changes.

Option #1: For Small Organizations

  1. Each issue with type: bug is considered as a quality fault

  2. Incidents are tracked as priority: highest bugs.

  3. Production bugs are tracked as priority: normal/high bugs.

  4. Staging & Development bugs are tracked as priority: low bugs.

Benefits

  • It’s simple

  • Better for small organizations

Cons

  • Requires alignment & education across all the team

Option #2: For Big Organizations

  1. All incidents SHOULD be created by Incident Management tooling (pagerduty, opsgenie etc.) creates an issue with type: Incident

  2. All customer issues SHOULD be created by your Support Suite tooling (zendesk, intercom etc.). Customer success team creates type: Customer Bugs

  3. Any internally catched bugs should be created as type: bug

Benefits

  • System enforces correct tagging which gets rid of alignment & education

  • Better for large organizations

Cons

  • It takes change management across DevOps, Product, Engineering, CS teams

Tactical advice on improving quality

Improving anything goes through the same process

  1. Check metrics

  2. Identify where to improve

  3. Brainstorm on bets

  4. Implement the bet

  5. Check metrics if the bet resulted in succeed

  6. Repeat

Action we take should either

  1. Decrease time it takes us to fix a bug (Bug Resolution Time)

  2. Decrease how many bugs we release to customers (Bugs Created)

1. Check Metrics

You should have board like the following where you track Quality as part of your operational metrics.

2. Identify where to improve

At this point you have 2 options

Debug a single issue

If you click any graph you’ll see all the issues in that week.

Debug high level patterns

Click on the graph and group by the field you’d like to understand the patterns of

Most common group by options are

  1. Team

  2. Priority

Once you identify where you should be focusing on high level, next step is to find the pattern.

Click on the identified team/priority/group. You’ll see all the related issues

From this view, you’ll need to check multiple metadata and try to see what might be the pattern. In this image it seems like a lot of issues get stuck on the QA status (teal column) in Jira board.

Now we have identified where to improve we can go into next step.

3. Brainstorm on bets

Once you understand what the problem is, next step is to brainstorm on potential fixes on the problem.

Always look for root cause of the problem. Use following template to find root cause

  1. What is the Root cause

  2. What is the Customer impact

  3. What action can we take to prevent this from happening again?

4. Implement the bet

Execute on the idea we have that would fix the bet.

5. Check metrics if the bet resulted in succeed

Once we have implemented the fix, we should

  1. Check the metrics if it improved either Bug Resolution Time or Bugs Created

  2. Check if we have the same issue happening again.

To improve we need to ensure no bug comes twice.

6. Repeat

If we repeat this action a few dozens times - typically in 2-6 weeks - we’ll see drastic improvements across our engineering quality.

Last updated