This is second article
from
Patrick Martin
who
is sharing with us his
experience as a
developer working on
problems related to
software quality. You
might find his first
article,
why one CPU is better
than two interesting
as well.
|
|
There are
some basic
rules of
thumb which
will serve
you well in
testing any
application
that deals
with lists
of data (and
which
applications
don't?).
1.1; few;
many"
2."don't
make any
assumptions"
3."remember
to mix it up
a little"
This case
study covers
a very
interesting
example of
where
following
the rules of
thumbs
*exactly*
paid
dividends. |
The Problem
There was an
interactive reporting
solution that was having
performance issues:
essentially there was
some pathological
performance degradation
under some
circumstances.
The code contributing
the biggest slice of
time wastage was in a
3rd party component that
really could not be
There were multiple
passes at the problem
and the usual things
were done:* Direct:
get the problem project
and simply pause the
debugger at the obvious
pain point
* Indirect: scrub the
code looking for
opportunities to
* (Eventually)
Validation through
testing: write an
automated script to
exercise the the
reporting component
through a number of
configurable scenarios
These scrubs made a
palpable difference on
each iteration: for
reasons related to the
constraints of the 3rd
party component and the
application framework it
was hosted in,
traceability was not
great in the code and it
turned out to be
difficult to get the job
right first time.
There were significant
improvements made
however: minutes became
seconds in some cases
which gives an
indication of the
potential severity of
the issue for an
interactive application.
However, it turns out
even if the coding had
been got "right first
time" there was a
lurking issue that would
have caused a return to
the drawing board due to
a serious performance
issue.
The bombshell
In this case it's
worth going into a
little detail:
Developer and testers
have an awareness to
some level of the
importance of the
complexity of a given
process on the run time:
essentially they tend to
have an expectation that
(roughly) the run time
will go as O(n) and for
many apps there is a
strong pressure for n to
be a small number,
otherwise the user
experience can be very
disappointing. Many
defects are raised
around the issue that n
is not in fact a small
number. In this
instance, n seemed to be
an acceptable figure if
not great.
The developers and
testers got a big upset
when at the last moment
of the product release
cycle a project was
submitted that
completely defied the
performance as seen by
the developers and QA.
It took ages to perform
what was next instant on
much larger chunks of
data.
What was going on?
The developer
debugged the
application: the same
call sequence as any
other data; just a very
different time taken.
What was different in
the project?
Well, one of the default
limits had been busted
wide open from the
default value of several
hundred to several
thousand - there had
never been any testing
for this number of items
in these lists, but
still a few thousand is
a small number for
modern machines with
functions having an
acceptable order of n.
So, expectations were
being defied: what was
going on?
The automated test was
re-run with the limits
set up to the
default-busting value -
still performance was
acceptable and far
better then the problem
project.
It was now down to the
data in the project. The
automated test which
created data in a
controlled way could no
longer help.
Back to the developer.
Working furiously to
the very last days of
the development deadline
the answer suddenly
became apparent in
debugging and comparing
the behaviour of the
"good" big projects and
the "bad" big projects.
The behaviour of a key
function involved in the
interactive performance
of the component was not
simply characterised as
O(n). A better
indication would be O(n)
+ O(x-y), where x and y
are the counts of items
in the lists used by the
component.
Where x and y where
below the application
defaults the second term
was never noticed
through behaviour and
the 3rd party component
source was never
scrutinised to the level
where the flaw could
have been found. When
the application defaults
were massively exceeded
this O(x-y) had the
opportunity to become -
"quite significant".
Why didn't the
automated tests catch
this when the new larger
sizes were set in
configuration?
Because the tests
made a reasonable
assumption - that n
mattered and hence it
was not noticed by the
tester and developer
that the list sizes were
all the same - x-y was
always zero
The problem project
has real data with large
and different lists
counts - all it took was
a list to be several
1000 items long and for
another to be small for
the lurking performance
issue to be exposed.
The moral of the
story
It could be argued the
original bug report
contained the kernel of
the solution, by
providing an example of
rule of thumb #3.
Rule of thumb #1 was
initially thought to be
adequately covered but
fatally undermined by
the pathological
behaviour of the
application. The
application appeared to
be matching the testers'
and developers' internal
model for certain highly
specific conditions,
which were sadly
unrealistic in one key
feature: rule of thumb
#3
Note: It is doubtful the
flaw in the 3rd party
component could have
been found through
desk-checking the source
in any realistic
timescale. |