Want to read
the
introductory article
Guerrilla Testing Tips: the inside track from a developer's perspective
and biography of the author
from
Patrick Martin
|
|
This is a
very
interesting
real life
example of a
classic
concurrency
issue and
very neatly
illustrates
the abstract
process of
fault
finding a
race
condition.
What makes
it
particularly
interesting
is that
rather than
involving
high
performance
multi-threaded code
it is in an
automated
user
interface
test.
|
The Problem
A test
developer is in the process
of working on an automated
TestComplete test which is
part of a suite that is run
on a number of benchmark
virtual machines. He
discovers he has an
interesting problem: his
test fails quite regularly
with an odd error on his
machine but works on the
test virtual machines. In an
even more confusing
development, when he debugs
the test script in the test
tool IDE the bug does not
reproduce.
The Symptoms
The error is
"invalid operation" when the
fault reproduces and happens
in a routine that simply
invokes the close action on
all the forms contained in
an MDI application. The
routine is coded adequently
enough: if a form is present
the close action is invoked:
else nothing is done and no
error returned. It's a
simple fire and forget
function intended to get a
simple job done prior to
starting another test.
The Cause
Browsing in
the code before and after
the point of the error
reveals the close all forms
code is actually run twice:
once at the very end of a
function which has been
called as a sub-routine just
before the same procedure
attempts the same process.
The Detailed
Diagnosis
Recording,
running and debugging tests
using TestComplete
illustrates that tools like
TestComplete do a very good
job of making an inherently
very complex problem appear
very simple. There are
features in the engine that
mask the asynchronous nature
of the communication of
the simulated user actions
from the TestComplete engine
to the application under
test. This is only a
fiction, however and under
the right circumstances this
fiction can be exposed; this
is just such an occasion and
how this fiction was exposed
is revealing.
What is
happening is that the test
was exceuting the function
directing any open windows
to close, and then shortly
afterwards again running the
same code.
In
investigating the difference
between the various
scenarios is was found:
-
The test code being
executed was indeed
identical (always
worth checking)
-
The execution
environment of the
tests was identical
in version
-
The developer's
machine had 2 CPUs
and the virtual
machines had 1 CPU
among other less
obvious differences
Around this
point it became clear what
had been happening and it's
related to the way the OS
apportions time slices and
the time it takes for these
windows to close.
On a single
core machine time slices are
doled out to the many tasks
on the machine in a
pre-emptive manner in order
to achieve multi-tasking. Of
course this is going on with
the processes of
TestComplete and the
application under test; on
single core it's something
like 5 ms of TestComplete
then 5 ms of the
application, plus change for
vital tasks the operating
system has to perform like
passing messages between
applications. With so much
time (?! - really!) when an
application effectively has
the machine to itself it's
easy to see how an action
initiated by
TestComplete can be entirely
completed when the
application under test gets
its time slice. Although
there is a race condition in
the test script code, having
a single core "forces a
winner" which results in
correct operation for a
large proportion of the
time.
On multiple
cores the race condition can
be exposed and if in fact
highly likely to be exposed:
TestComplete goes through
all the open forms and
directs them to close; the
application under test,
having another core to run
on starts to do this
immediately; TestComplete
then repeats the order to
the forms to close; some of
the forms being in process
of closing down, this is an
error and is raised as such
in the test script.
The Cure
The cure
should be fairly obvious:
remove the superfluous code
that closes all windows; as
it turns out the problem
code was due to a copy and
paste when it should have
been a
cut
and paste.
The Moral of
the Story
There are
several:
-
when coding, use
source control to
review what you've
been doing and make
sure all the code
you've been moving
around has moved to
the right places and
also
removed from the
original places.
-
when debugging a
problem related to
concurrency there
are some watchwords
that developers are
very familiar with:
In this case
is seemed natural to assume
that what TestComplete
presents as happening as
being the exact case. Namely
the script directs the
application windows to close
and they close without
issue; in this view, what
could be incorrect in
performing the same
logically flawless operation
twice? If all the windows
closed, then closing zero
windows is wasteful but not
incorrect.
Since we know
the reality is not that
simple and the
window closing
process was initiated
and then it was attempted
again on windows that were
already closing it is easy
to see how it is incorrect
to tell a window to close
twice. At best it might
ignore the request, at worst
it may be literally unable
to respond to the request.
Hopefully this
simple example has
illustrated a vital issue in
the testing of complex
applications and the
development of test scripts
and in the future when a
similar issue crops up the
same checklist will
help uncover the root cause
quickly:
-
Discover what
parameters are
varied between the
situations
-
Examine what is
setting up the
conditions to
reproduce the fault
-
Desk-check the code
by working through
it logically to
confirm your
understanding
-
Don't panic: if you
are presented with a
working scenario and
a broken scenario
you are being
handed the
conditions for
success on a plate,
albeit a little
obliquely and with
some user assembly
required.
|