September 20, 2016

Oscillations make our systems difficult to maintain

One of the most difficult thing about designing systems (both in the technical and human sense) is figuring out how to keep oscillations under control.

You can see the impact of oscillation in every aspect of your work:
  • In businesses, cash flows encourage greater risk taking and speculation when money is at either extreme (i.e. massive reserves or none at all), and discourage risk taking close to the break-even point. But past decisions influence future outcomes, and so you may pay for yesterday's risky behavior tomorrow, or look back at yesterday and wish that you hadn't passed on some opportunity that is obvious in retrospect.
  • In application development, when a system is in its early functional stages, it is easy to add lots of features and even if the defect rate is somewhat high, bugs can be quickly fixed because there is enough time to go around and the proportion of improvement time vs. repair time is still strongly tilted towards the productive side of the equation. But then as the hard-to-fix bugs or the latent bugs accumulate, the productive output tanks. This sometimes inspires a "big cleanup" which drives the productive throughput back up, but then if the same patterns hold, this cycle will repeat itself over and over again.
  • In product design, a product that has become too unwieldy or is no longer attractive to its audience sees reduced active use day-to-day and fewer word-of-mouth referrals (or worse, negative reviews). This can be a bad thing from a business perspective, but at the same time, it can also lift the pressures that come along with being "in demand" or "trending." This gives product designers an opportunity to go back to the drawing board and find some new idea that might be a big hit, but if they aren't careful to come up with some sort of managed plan for growth, then the same thing will happen all over again.
The problem with oscillations is that we can't ever recognize their patterns until the up swing or downswing is already underway. So that means that if you're waiting on seeing a change in sign, or a set change in magnitude, you're almost certainly going to be caught ahead of or behind the wave.

My rough thoughts on the solution to the oscillation problem are that you need to take a four pronged approach:

1) Bake in constraints and circuit breakers that limit the extreme behaviors of any essential part of a system. This won't stop an oscillation from occurring, but it'll at least put a limit on how far it can go before the system gets too far out of balance.

2) Measure and track failures as well as near-misses at the edges of a system. Communicate about them, analyze them, and redesign constraints to tune the system so that it can do a better job of staying within an acceptable, stable operating range.

3) Design and implement a clear, easy way to "stop the line" when a system gets too far out of balance. No one likes a temporary shut down, but trying to fit a wrench into an overheating motor isn't a wise decision, either.

4) Remember that systems (in theory) are designed by humans to achieve a certain goal. If a particular system isn't meeting that goal well, or if it manages to meet its goal only by way of a continuous rollercoaster ride that is putting strains on the people and processes maintaining it, then the difficult but honest question of "should we get rid of this system entirely and go back to the drawing board?" does need to come up.

We should have a deeper conversation about this. It's a topic that really matters to everything we do in this field. I'm still gathering my thoughts, but would love to hear yours. You can reach my by replying directly to this email. :-)