problems with Marketing and Engineering (was: Documenting Betas)

Subject: problems with Marketing and Engineering (was: Documenting Betas)
From: John Wilcox <john -at- SYNTAX -dot- COM>
Date: Tue, 14 Nov 1995 13:11:25 -0800

> Date: Tue, 14 Nov 1995 12:18:57 -0500
> From: Esther Wheeler <ewheeler -at- AZURE-TECH -dot- COM>
> Subject: Documenting Betas and Finals

> I have (in the past) worked on products which I feel have glaring
> user interface problems. However, if engineering thinks it "gets
> the job done" they may want to release anyway. I try to get in
> early and make a lot of user-friendly noise to cut down on this.

I have been blessed to be in a position where I am actually required to
provide feedback on engineering designs. And I, too, find such feedback
helps. Comments such as, "Do you actually think the user will
understand this?" Or, "This is the stupidest thing I've ever seen."
(I confess that tact is one of my lesser virtues.)

> Sometimes you can find an ally, like in marketing.

An ally in marketing. Hmmm. I have found that most marketing types are
at least as far out in left field as engineers. (I'm a Dilbert fan.)
Glad you have found an exception.

> But I'm always looking for techniques that work -- or even funny stories.

I sent the following to our engineering and marketing departments in
May 1994. It's long (about 400 lines), but worth it. And it is the
source of one of my favorite sayings: Marketing -- where the rubber
meets the sky.

-----------------------------------------------------------------------
This claims to be an internal Silicon Graphics memo. How it managed to
escape from SGI, I don't know. (I got it from someone at Microsoft,
where it was making the rounds.) It describes some of the problems
encountered and lessons learned during recent software development at
SGI. I think the author offers some valuable insight, from which I
think any company could benefit. The author describes problems and
possible solutions in the engineering and marketing departments, but I
think they could be adapted to others as well. DISCLAIMER: This is not
intended as an indictment of anything or anyone at Syntax. Consider it
an input to the TotalQuality program. Some parts of this will not
necessarily be applicable, depending on the reader, so just "Eat the
meat and throw away the bones."

Software Usability II
October 5, 1993
Tom Davis

SUMMARY

Release 5.1 is a disappointment. Performance for common operations has
dropped 40% from 4.0.5, we shipped with 500 priority 1 and 2 bugs, and
a base Indy is much more sluggish than a Macintosh. Disk space
requirements have increased dramatically.

The primary cause is that we attempted far too much in too little time.
Management would not cut features early, so we were forced to make
massive cuts in the final weeks of the release.

What shall we do now? Let's not look for scapegoats, but learn from our
mistakes and do better next time.

A December release of 5.1.2 is too early to fix much -- we'll spend much
more time on the release process than fixing things. Allow enough time
for a solid release so we don't get: 5.1.2.1, 5.1.2.2, 5.1.2.3, ...

Let's decide ahead of time exactly what features are in 5.1.2. If we
pick a reasonable set we'll avoid emergency feature cuts at the end.

Nobody knows what's wrong -- opinions are as common as senior engineers.
The software environment is so convoluted that at times it seems to
rival the US economy for complexity and unpredictability. I propose
massive code walk-throughs and design reviews to analyze the software.
We'll be forced to look closely at the code, and fresh reviewers can
provide fresh insights.

For the long term, let's change the way we do things so that the
contents and scheduling of releases are better planned and executed.
Make sure marketing and engineering expectations are in agreement.

BLOAT UPDATE

"Do you want to be a bloat detective? It's easy; just pick any
executable. There! You found some!" -- Rolf van Widenfelt

In the May report, I listed a bunch of executable sizes, and pointed out
that they were unacceptable if we intended to run without serious paging
problems on a 16 MB system. Between May and the 5.1 release, many have
grown even larger. IRIX went up from 4.8 MB to 8.1 MB, and has a memory
leak that causes it to grow. Within a week, my newly-booted 5.1 IRIX
was larger than 13.8 MB -- a big chunk of a 16 MB system. It's wrong to
require our users to reboot every week.

There are too many daemons. In a vanilla 5.1 installation with Toto,
there are 37 background processes.

DSOs were supposed to reduce physical memory usage, but have had just
the opposite effect, and their indirection has reduced performance.

Programs like Roger Chickering's "Bloatview" based on Wiltse Carpenter's
work make some problems obvious. The news reader "xrn" starts out
small, but leaks memory so badly that within a week or so it grows to 9
or 10 MB, along with plenty of other large programs. But what's really
embarrassing is that even the kernel leaks memory that can't be
recovered except by rebooting!

Showcase grew from 3.2 MB to 4.0 MB, and the master and status gizmos
which are run by default occupy another 1.7 MB. Much of this happened
simply by recompiling under 5.1 -- not because of additional code.

The window system (Xsgi + 4Dwm) is up from 3.2 MB to 3.6 MB, and the
miscellaneous stuff has grown as well. As I type now, I have the
default non-Toto environment plus a single shell and a single text
editor, jot. The total physical memory usage is 21.9 MB, and only
because I rebooted IRIX yesterday evening to reduce the kernel size.
Luckily, I'm on a 32 MB system without Toto, or I'd be swamped by
paging.

Much of the problem seems to be due to DSOs that load whole libraries
instead of individual routines. Many SGI applications link with 20 or
so large DSOs, virtually guaranteeing enormous executables.

In spite of the DSOs, large chunks of Motif programs remain unshared,
and duplicated in all Motif applications.

PERFORMANCE UPDATE

"Indy: an Indigo without the 'go'." -- Mark Hughes

"X and Motif are the reasons that UNIX deserves to die." -- Larry Kaplan

The performance story is just as bad. I was tempted to write simply,
"Try to do some real work on a 16 MB Indy. Case closed.", but I'll
include some details.

In May, I listed some unacceptable Motif performance measurements. Just
before 5.1 MR, someone reran my tests and discovered that the
performance had gotten even worse. Some effort was expended to tune the
software so that instead of being intolerable, it was back to merely
unacceptable performance.

We no longer report benchmark results on our standard system. The
benchmarks are not done with the DSO libraries; they are all compiled
non-DSO so that the performance in 5.1 has not declined too much.

What's most frightening about the 5.1 performance is that nobody knows
exactly where it went. If you start asking around, you get plenty of
finger-pointing and theories, but few facts. In the May report, I
proposed a "5% theory," which states that each little thing we add
(Motif, internationalization, drag-and-drop, DSOs, multiple fonts, and
so on) costs roughly 5% of the machine. After 15 or 20 of these, most
of the performance is gone.

Bloating by itself causes problems. There's heavy paging, there's so
much code and it's so scattered that the cache may as well not be there.
The window manager and X and Toto are so tangled that many minor
operations like moving the mouse or deleting a file wake up all the
processes on the machine, causing additional paging, and perhaps
graphics context swaps.

But bloat isn't the whole story. Rocky Rhodes recently ran a small
application on an Indy, and noticed that when he held the mouse button
down and slid it back and forth across the menu bar, the (small) pop-up
menus got as much as 25 seconds behind. He submitted a bug, which was
dismissed as paging due to lack of memory. But Rocky was running with
160 MB of memory, so there was no paging. The problem turned out to be
Motif code modified for the SGI look that is even more sluggish than
regular Motif. Perhaps the problem is simply due to the huge number of
context swaps necessary for all the daemons we're shipping.

The complexity of our system software has surpassed the ability of
average SGI programmers to understand it. And perhaps not just average
programmers. Get a room full of 10 of our best software people, and
you'll get 10 different opinions of what's causing the lousy performance
and bloat. What's wrong is that the software has simply become too
complicated for anyone to understand.

WHAT WENT WRONG IN 5.1?

The one sentence answer is: we bit off more than we could chew. As a
company, we still don't understand how difficult software is.

We planned to make major changes in everything -- a new operating
system, new compilers, a new user environment, new tools, and lots of
new features in the multimedia area. Not only that, but the new stuff
was promised to do everything the old software had done, and with major
enhancements. (Early warning: version 6.0 promises to be even more
disruptive.)

About 9 months ago, Rocky and I pointed out the impossibility of what we
were attempting. Rather than reduce the scope of the projects, a
decision was made to hire a couple of contractors (who know nothing
about our system) to handle the worst user interface problems in the
Roxy project. In addition, promises were obtained from various
executives that a significant effort would be made to improve software
performance.

Management was basically afraid to cut any features, so we continued to
work on a project that was far too large. The desperate attempt to do
everything caused programmers to cut corners, with disastrous effects on
the bug count. And the bug count was high simply because 5.1 was so
big.

Only when the situation was beyond hope of repair did we start to do
something. Features and entire products were removed wholesale from the
release, and hundreds of high-priority bugs were classified as
exceptions, so that we could ship with "no priority 1 and 2 bugs." We
did, however, ship with over 500 "exceptions." The release was deemed
too crummy to push to all our machines, but was restricted to the
Indys, the high-end machines, and a few others where new hardware
required the new software. Due to the massive bug count, virtually no
performance tuning was done.

When the schedule is impossible as it was in 5.1, the release process
itself can get in the way. The schedule imposes a code freeze long
before the software is stable, and fixing things becomes much more
difficult. If you know you're going to be late, slip before the code
freeze, not after. We're trying to wrap up the box before the stuff
inside is finished, and then trying to fix things inside the box
without undoing the wrapping -- it has to be less efficient.

Management Issues:

There was never an overall software architect, and there still is not,
and until Way Ting was given the job near the end, there was no manager
in charge of the 5.1 release, either.

I wrote a note in sgi.bad-attitude about the "optimist effect," which I
believe is mostly true. In condensed form:

Optimists tend to be promoted, so the higher up in the organization you
are, the more optimistic you tend to be. If one manager says "I can do
that in 4 months," and another only promises it in 6 months, the 4 month
guy gets the job. When the software is 4 months late, the overall
system complexity makes it easy to assign blame elsewhere, so there's no
way to judge mismanagement when it's time for promotions.

To look good to their boss, most people tend to put a positive spin on
their reports. With many levels of management and increasing optimism
all the way up, the information reaching the VPs is very filtered, and
always filtered positively.

The problem is that the highly filtered estimates are completely out of
line with reality (at least in recent software plans here at SGI), and
there are no reality checks back from the VPs to the engineers on the
bottom. I think it's great to have aggressive schedules where you try
to get things out 20% or so faster than you'd expect. The problem is
that in 5.1, the engineers were expected to get things out 80% faster,
and it was clearly impossible, so many just gave up.

We certainly didn't win any morale prizes among the engineers with 5.1.
It's the first release here at SGI where most of the engineers I talked
to are ashamed of the product. There are always a few, but this time
there were many. When engineers were asked to come in over the weekends
before the 5.1 release to fix show-stopper bugs, I heard a comment like:
"Why bother? SGI's going to release it anyway, whether they're fixed or
not."

I'm not blaming the engineers. Most of them worked their hearts out for
5.1, and did the best they could, given the circumstances. They'll be
happy to buy into a plan where there's a 20% stretch, but not where
there's an 80% stretch. They figure: "It's hopeless, and I'll be late
anyway, and I'm not going to get rewarded for that, so why kill myself?"

Marketing-Engineering Disconnect

"Marketing -- where the rubber meets the sky." -- Unknown

There's a disconnect between engineering and marketing. It's not
surprising -- marketing wants all the whiz-bang features, it wants to
run in 16 MB, and it wants it yesterday. Although engineering would
like the same things, it is faced with the reality of time limits, fixed
costs, and the laws of nature.

It's great to have pressure from marketing to do a better job, but at
SGI, we often seem to have deadlocks that are simply not resolved.
Marketing insists that Indy will work in 16 MB and engineering insists
that it won't, but both continue to make their plans without resolving
the conflict, so today we're shipping virtually useless 16 MB systems.
Similarly for feature lists, reliability requirements, and deadlines.
Well, at least we met the deadline.

WHAT TO DO -- SHORT TERM (5.1.2)

"We should sell 'bloat credits', the way the government sells pollution
credits. Everybody's assigned a certain amount of bloat, and if they go
over, they have to purchase bloat credits from some other group that's
been more careful." -- Bent Hagemark

There are problems in both performance and bugs, and we'd like to fix
both. In addition, the first thing we should do is decide exactly
what's going into release 5.1.2.

If we are serious about a December all-platforms release, there may be
very little we can do other than keep stumbling along as we have been.
Three months isn't much time to do anything, considering the overhead of
a release, where perhaps half of the time will be spent in "code
freeze." After 5.1, many engineers are exhausted, and it's unreasonable
to expect them to start hard work immediately. 500 outstanding priority
1 and 2 bugs is a huge list, and we haven't even begun to hear about
customer problems yet.

What Should be in Release 5.1.2:

I'm afraid the answer is going to be "everything that didn't make it
into 5.1." I know that won't be the case, but I hope that we will
carefully select what goes in now, rather than hack things out in a
panic in December. The default should be "not included," and we should
require a good reason to include things. Let's make sure that there's a
minimal, solid, working set before we start adding frills.

Improving Performance:

"SGI software has a cracked engine block, and we're trying to fix it
with a tune-up." -- Mark Segal

As stated above, we don't even know exactly what's wrong. We probably
never will, but we should start doing things that will have as much of
an impact on the problem as possible. I don't think we have time to
study the problem in detail and then decide what to do -- we've got to
mix the research with doing something about it.

Before we begin, we should have definite performance goals -- lose less
than 5% wall-clock time on compiles of some known program over 4.0.5,
have shells come up as fast as in 4.0.5, or whatever.

Some people claim that we need new software debugging tools to look at
the problem, and that may be true, but it's not a short-term solution,
and it runs the risk of causing us to spend all our time designing
performance measurement tools, rather than fixing performance.

In fact, I don't really believe that simple "tuning" will make a large
dent. To get things to run significantly faster, we've got to make
significant changes. And we can't beat the "5% rule" by just speeding
up all the systems by 5% -- if everything is exactly 5% faster, the
overall system will be exactly 5% faster.

There's a strong tendency to look for the "quick fix." "Get the code
rearranger to work," or "Put all the non-modifiable strings in shared
code space," for example. These ideas are attractive, since they
promise to speed up all the code, and they should probably be pursued,
but I think we're not going to make a lot of progress until we identify
the major software architectural problems and do some massive
simplification. Remember that DSOs were the last "quick fix."

There's got to be more to it than tuning; there must be some amazingly
bad software architecture -- from a novice's point of view, a 4 MB
Macintosh runs a far more efficient, interesting system than a 16 MB
Indy. The Mark Segal quote above sums it up.

Code walk-throughs and design reviews are in order for most of our
software. The attendees should include not only people working in the
same area, but a small cross-section of experienced engineers from other
areas. Get a pool of, say, 20 experienced engineers and perhaps 3 at a
time would sit in on code reviews together with the other people working
in that area.

Code reviews will help in many ways -- the engineer presenting the code
will have to understand it thoroughly to present it, others will learn
about it, and outside observers will provide different ways to look at
the problems.

The most important thing should be the focus -- we're trying to make the
code better and faster, not to make it more general, or have new
features, or be more reusable, or better structured.

For complex problems, the walk-through should also include some general
design review. Are these daemons really necessary? Do we really need
this feature? And so on.

Fixing Bugs:

The code walk-throughs will obviously tend to turn up some bugs, so
they'll serve a dual purpose.

With 500 or so priority 1 and 2 bugs, we must prioritize these as well.
A bug that causes a system crash only on machines with some rare
hardware configuration is properly classified priority 1, but it's
probably less important than a bug in a popular program like Showcase
that causes you to lose your file every tenth time, which would normally
rank as priority 2. The effort involved in the fix should also be taken
into account. For bugs of equal frequency of occurrence, it's probably
better to fix 20 priority 2 bugs than 1 priority 1 bug if the priority 2
bugs are 20 times easier to fix.

A bunch of bugs can be eliminated by getting rid of features. Let's
have the courage to cut some of the fat.

WHAT TO DO -- LONG TERM

"Software quality is not a crime." -- Unknown

It's easy to go on forever here, but I'll try to limit it to a few key
ideas. We don't have to do all these at once, but we'd better start.

Have an overall SGI software plan.

Let's get an architect, or at least a small group of highly technical
people, not just managers, to agree on plans for releases. In fact,
since the release is a company-wide project, there ought to be company-
wide participation in the decisions of what's in a release. The group
should include marketing, documentation, engineering, and management and
should come up with a compromise that's reasonable to all.

In every case, some attempt must be made to check reasonableness all the
way to the bottom. There's a long series of excuses -- "Well, that's
what my junior VPs told me," or "That's what my directors/managers/lead
engineers/engineers told me." We get killed by the optimist effect and
a disinclination to listen seriously to anyone but our direct reports.
Try to imagine the guts it takes for an engineer to go to his director
and say: "My manager's out of his mind -- I can't possibly do what he's
promised."

Let's try to concentrate on performance and quality, not on new
features, especially for the 5.1.2 release. I know from my own
experience that when I write good code, I spend 10% of the time adding
features, and 90% debugging and tuning them. It's the only way to make
quality software. In SGI's recent releases, the opposite proportions
are often the rule. It's much easier to add 100 really neat features
that don't work than to speed up performance by 1%.

Aim for simplicity in design, not complexity. Make a few things work
really well; don't have 1000 flaky programs.

Be willing to cut features; who's going to be more pissed off: a
customer who was promised a feature that doesn't appear, or the same
customer who gets the promised feature, and after months of struggling
with it, discovers he can't make it work?

Get better agreement between the top level VPs and the lowest engineers
that a given schedule is reasonable.

For new development, continue the formal design reviews and code walk-
throughs. These shouldn't just happen once in the development cycle --
things are bound to change, and code reviews can be very valuable, even
for our experienced programmers.


Previous by Author: Re: Gender bias in Cyberspace
Next by Author: The Death of the Apostrophe, and the sickness of the hyphen
Previous by Thread: Re: Gender bias in Cyberspace
Next by Thread: Screen capture program for Win NT


What this post helpful? Share it with friends and colleagues:


Sponsored Ads