How to Know if TDD is Working

How will you know if Test-Driven Development (TDD) is working for your teams, program, or organization?

I've noticed that small, independent teams typically don't ask this.  They are so close to the end-points of their value-stream that they can sense whether a new discipline is helping or hindering.

But on larger programs with multiple teams, or a big "roll-out" or "push" for quality practices, leaders want to know whether or not they're getting a return on investment.

Sometimes management will ask me, point-blank: "How long before I recoup the cost of your TDD training?"

There are a lot of variables, of course; and knowing when you've reached break-even is going to depend on what you were doing, and for how long, before you started a TDD discipline. Also, you're not going to be able to measure the change in a metric you're not already measuring. So I suggest teams start measuring what matters, now.

Nevertheless, you may be able to tell simply by the morale on the teams. In my experience, there's always a direct correlation between happy employees and happy customers. Also, a direct correlation between happy customers and happy stakeholders.  That's the triple-win:  What's truly good for customers and employees is good for stakeholders.*

A graph sent (unsolicited) to me from one very pleased client. (Yeah, it'd be great if they had added a "value" line. Did I mention unsolicited?) There's the obvious benefit of fewer defects, but also note that bugs-found is no longer oscillating at…

A graph sent (unsolicited) to me from one very pleased client. (Yeah, it'd be great if they had added a "value" line. Did I mention unsolicited?) There's the obvious benefit of fewer defects, but also note that bugs-found is no longer oscillating at release boundaries. Oscillation is what a system does before tearing itself apart.

Metrics I Often Recommend

Here are some metrics I recommend to teams.  I'm not suggesting you must track all of these.

  • Average lead time for defect repair: Measure the time between defect-found and defect-fixed, by collecting the dates of these events. Graph the average over time.

  • Average cycle time for defect repair: Measure the time between decide-to-fix-defect and defect-fixed, by collecting the dates of these events. Graph the average over time.

  • A simple count of unfixed, truly high-priority defects. Show-stoppers and criticals, that sort of thing. Graph the count over time.

Eventually, other quality metrics could be used.  Once a team is doing well, Mean Time Between Failures (MTBF), which assumes a very short (near-zero) defect lead time, can be used.

On one high-performing Agile team I worked on, with my friend James Shore in 2001, we eventually focused on one metric:  "Age of Oldest Defect."  It really got us to dig into one old, ornery, hard-to-reproduce defect. The reason why it had eluded us for so long is—in part—that it had a ridiculously simple work-around: Customers could take a deep breath and re-submit their requests, and so they rarely reported it.

Whenever it occurred, we would get a stack-trace from our nifty self-reporting Exceptions subclass (which did not just bury the stack trace in a log file, but would e-mail it to the dev team). We knew it was caused by a rare database connection timeout, but we had never before been able to ascertain why a connection in our connection pool would ever go stale. “Inconceivable!”

This defect was a great representation of a general rule of bug-fixing:  Most bugs are easy to fix once you know why it’s happening…but that’s the hard part!

Agile teams may also want to keep an eye on this one:  Average cycle &/or lead times for User Stories, or “Minimal Likeable Features.” On the surface, this sounds like a performance metric.  I suppose if the work-items are surely arriving in a most-important-thing-first order, then it's a reasonable proxy for "performance."  But its real purpose is to help diagnose and resolve systemic (i.e., "process") issues.

What’s truly important about measuring these:

  1. Start measuring as soon as possible, preferably gaining some idea of what things look like before making broad changes, e.g., before your Essential Test-Driven Development training begins.

  2. The data should be collected as easily as possible: Automatically, or by an unobtrusive, non-managerial, third party. Burdening the team with a lot of measurement overhead is often counterproductive: The measurement data suffers; productivity suffers; morale suffers.

  3. The metrics must be used as informational and not motivational: They should be available to the team, first and foremost, so the team can watch for trends. Metrics must never be used to reward or punish the team, or to pit teams within the same organization against each other.**

If you want (or already have) highly-competitive teams, then consider estimating Cost of Delay and CoD/Duration (aka CD3, estimated by all involved "levels" and "functions"), customer conversions, customer satisfaction, and other Lean Startup metrics; and have your organization compete against your actual competitors.  

Metrics I Rarely Recommend

Velocity:

Estimation of story points (SPs) and the use of velocity (SPs/timebox) may be necessary, for a brief time, on a team where the User Stories vary considerably in size.  Velocity is an simple planning tool that gives the team an idea of whether the scope they have outlined in the release plan will be completed by the release date.

When done correctly, story points and velocity give information similar to cycle time, just inverted.

To illustrate this:  Often Scrum teams who stop using sprints and release plans in favor of continuous flow will switch from story points per sprint to average cycle time per story point. Then, if the variation in User Story “size” (effort, really) diminishes, they can drop points entirely and measure average cycle time per story.

The problem with using velocity as a metric to track improvements (e.g., the use of TDD) is this:  As things improve, story-point estimates (again, it’s effort, not time) may actually drop for similar stories.  You should expect velocity to stabilize, not increase, over time.  Velocity is for planning; it's a very poor proxy for productivity.

Code coverage:

You could measure code-coverage, how much of the code is exercised via tests, particularly unit-tests, and watch the trends, similar to the graph above (they measured number-of-tests).  This is fine, again, if used as an informational metric and not a motivational metric.  Keep in mind that it's easy for an informational metric to be perceived as motivational, which makes it motivational.  The trouble with code-coverage is that it is too much in the hands of those who feel motivated to improve it, and they may subconsciously "game" the metric.

About 15 years ago, I was working with a team who had been given the task of increasing their coverage by 10% each iteration.  When I got there, they were at 80%, and very pleased with themselves.  But as I looked at the tests, I saw a pattern:  No assertions (aka expectations)!  In other words, the tests literally exercised the code but didn't test anything.  When I asked the developers, they looked me in the eyes, straight-faces, and said, "Well, if the code doesn't throw an exception, it's working."

Of course, these junior developers soon understood otherwise, and many went on to do great things in their careers. But they really did think, at the time, they were correctly doing what was required. That old truism is…well…true:  You get what you measure!

The metrics that I do recommend are more difficult to "game" by an individual working alone.  Cycle-times are a team metric.  Yes, it's possible a team could conspire to game those metrics, but they would have to do so consciously, and nefariously.  If you don't—or can't—trust your team to behave as professionals, no metric or engineering practice is going to help anyway.  You will simply fail to produce anything of value.


* But don’t just take my word for it. My old friend, Rich Sheridan, did a thing: Joy, Inc.

** Robert Austin, Measuring and Managing Performance in Organizations

Previous
Previous

Avoiding Tech Debt with these “Core Four” practices

Next
Next

A Dozen Reasons Why Test-First Is Better Than Test-Later