Problems with Excel
From: P.J.Wells@OPEN.AC.UK [mailto:P.J.Wells@OPEN.AC.UK]
Sent: Donnerstag, 8. März 2001 19:34
To: allstat@JISCMAIL.AC.UK
Subject: Summary: Excel for statistics
Last month I posted a request for information about the alleged
deficiencies
of Excel for serious statistical work to the Allstat and Excel-G lists.
This is evidently a widespread concern, as I received 37 replies either
offering suggestions or asking to be kept informed. Unfortunately,
pressure
of work has prevented me from compiling a full summary of the many helpful
replies until now.
My original interest in this was prompted by discussion on the
RadStats
list, so I am posting this there too, as well as copying it to all those
who
asked individually for a summary; apologies for the cross-posting which
will
result.
(As
AllStat rejects attachments, please contact me direct if you would
like
to receive the ones referred to below).
I've tried to give appropriate credit for specific points, but as might be
expected a number of references were the subject of multiple independent
messages, so I hope that the following generic credit will do justice to
Andrew Bertie
Jan Beyersmann
Jill Binker
Malcom Campbell
Dennis Chanter
John Cobby
Ronan Conroy
Nick Cox
Peter Das
Jeremy Dawson
Bjorn Donnis
Russell Ecob
Charles Fleischer
Tom Hawkins
Patrick Laycock
Rashid Nassar
R.Allan Reese
James Wallace
Jay Warner
Knut Wittkoski
Many thanks to you all.
Julian Wells
OU Business School
The Open University
Walton Hall
Milton Keynes
MK7 6AA
United Kingdom
+44 1908 654658
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1) Fundamental shortcomings of digital computation
These naturally apply to all programs, including high-end specialist
statistical ones; see B.D.
McCullough's two-part article in The American
Statistician (11/98, Vol 52, parts 4 and 5/99, Vol 53 part 2 -- available
in
PDF at
http://www.amstat.org/publications/tas/mccull-1.pdf and
http://www.amstat.org/publications/tas/mccull.pdf)
Part 1 discusses these basic problems of digital computing, including e.g.
truncation and rounding, and the use of standard data sets to test
statistical programs. Part 2 describes the application of the methodology
set out in Part 1 to SAS, SPSS and S-Plus. All three packages had
weaknesses
in random number generation, S-Plus's correlations were suspect, as are
one-way ANOVA and non-linear least squares in SAS and SPSS.
As a non-specialist, I found Part 1 especially chastening. For a simple
demonstration I devised the following (prompted by a Excel-G post on the
autofill feature); try:
Enter "100" in A1
Enter "=A1-0.1" in A2
Use the auto-fill facility to copy A2 down through (say) A110
Set Excel to display 13 places of decimals in the relevant cells.
How many of these cells would you want to use as the basis for any further
calculations which depended on the assumption that the contents were exact
multiples of 0.1?
2) Special problems of Excel
(a) A similar approach to the above is applied to Excel in:
McCullough, B. D. and Wilson, B. (1999)
On the accuracy of statistical procedures in Microsoft Excel 97.
Computational Statistics and Data Analysis 31:27-37.
For those whose institutions have the necessary subscriptions, this is
available on-line at
http://www.elsevier.nl/gej-ng/10/15/38/37/25/27/article.pdf.
As yet, I have not had a chance to look at this myself, but one respondent
who has seen it described it as "not very complimentary!".
(b) Peter Das forwarded a number of messages from his archives, including
one very critical of Excel's random number generation: it suggests that
there is "no theoretical basis" for the generator used in Excel.
For those for whom this is an important issue, the same message also
identified the following URLs offering "genuinely" random numbers derived
from physical data:
http://www.fourmilab.ch/hotbits/ (radioactive decay)
http://www.random.org/ (atmospheric noise)
Unfortunately the most fun-sounding such service,
http://www.lavarand.sgi.com/, using a lava lamp (!), appears to be no
longer
available.
(c) Knut M. Wittkowski is working on a (very impressive) animated
PowerPoint
presentation on a number of issues, of which the current version is
available from
http://www.rucares.org/Course/Excel.ppv.
(d) Nick Cox pointed out the eccentric nature of Excel's approach to
quartiles in the following message, along with other comments: his last
sentence is a point worth bearing in mind.
++++++++++++++++++++++
...
Here is something at the elementary end. Asked to give the quartiles of
1(1)10 Excel gives 3.25 and 7.75. It is quite an achievement for
Excel to use a formula for quartiles unknown to statistics, because
at least half-a-dozen or so exist (sometimes people interpolate, sometimes
they like to do nothing more complicated than split the difference between
two numbers, etc.), but I believe that Excel's formula is not known in the
literature. When challenged on this, the burden of Microsoft's response
was
apparently that this was to be compatible with other spreadsheets.
Bug-for-
bug compatibility, in other words.
On allstat a few years ago, a weird example was mentioned. Enter a column
of zeros and set a cell to contain their sum. Now change one cell to O
(upper case O). Now change another cell to l (lower case l). What happens?
What would be reasonable behaviour
for statistical purposes? Do you
want to trust this software?
As I understand it, Excel offers no facility for keeping a log or history
of what you did in a session. Being able to go back through the steps you
took, or that somebody else took, is vital for serious analysis. This is
especially because little worthwhile analysis can be done on the basis
of "plugging the numbers into a single formula" or "applying Foobar's
test".
It is not a fatal defect, but Excel graphics include several types which
I think unsound in principle (e.g. various 3-D effects which hinder
appreciation of the data). Some appear to be there largely for those users
who want something more bizarre than simple direct designs.
The main arguments for using Excel for statistics appear to be that many
people are using it already; have it on machines which are theirs or
accessible to them; are familiar with the style of computing it offers.
These can be powerful -- if not overwhelming -- considerations for some
groups of users in some circumstances, especially considering the cost of
statistical software and the fact that in many cases a command language
needs to be grappled with. But I have not yet heard, in years of asking,
any example of how Excel is intrinsically superior to statistical software
for any statistical purpose.
++++++++++++++++++++++
(e) Ronan Conroy sent the following example of Excel's dubious behaviour:
... a tiny dataset that made Excel give a totally meaningless regression
output.
x1 x2 x3 y
1 2 3 21
3 2 5 33
5 4 9 45
7 4 11 49
9 6 15 61
You do a model of y = x1 x2 x3
Excel gives you a load of old tosh.
Significant p-values for all x variables (despite the fact that x3 =
x1+x2) and an adjusted R^2 of 0.927, but an 'overall' p-value for the
regression of 0.171!
I wouldn't trust Excel's stats as far as I could throw them!
There was some discussion of this on edstat-l, and we compared the error
messages of various stats packages when faced with data like this. Some
were good and some were not, but Excel was the only package that went
ahead and calculated the nonsense regression requested.
3) Executive summaries
(a) One by Neil Cox of
AgResearch in New Zealand, at
http://www.agresearch.cri.nz/Science/Statistics/exceluse.htm; this links
to
the same author's more extended synopsis and commentary on
McCullough and
Wilson's article at
http://www1.agresearch.co.nz:8000/Science/Statistics/exceluse1.htm.
(b) The attached summary (excelbug.doc) by an anonymous author was passed
on
by Jeremy Dawson
(c) Doug Chanter supplied another (attached as Exceleng.doc), which not
only
documents problems with Excel's implementation of various procedures, but
makes the important point that the
documentation is also unreliable.
It fact, the documentation is not even reliably unreliable -- sometimes it
is wrong when the procedure is OK.
This point is perhaps of special relevance to those teaching statistics,
who
should warn their students off using Excel's documentation as a textbook.
4) Other resources
(a) Bjørn Donnis, among others, pointed out the Assume list (with
resources
website at
http://www.mailbase.ac.uk/lists/assume/files/welcome.html) many
useful links here, including to software add-ins (some of which are
freeware).
b) On the subject of add-ins, James Wallace drew attention to the
offerings
from NAG (
http://www.nag.co.uk/). Those in UK educational institutions
will
be able to obtain these under the CHEST arrangements.
It is also the case that S-Plus and SAS, and possibly other programs,
offer
the facility to operate as plug-ins to Excel. For example, the S-Plus
plug-in allows you to operate on your Excel data and insert an S-Plus
chart
into the Excel worksheet. From personal experience I have to say that this
seems a rather cumbersome way of going on, especially if one's final
output
is going to be a report in Word.
(c) As may be imagined, there are many other lists where this issue ahs
been
discussed, such as edstat and ap-stat
(d) Finally, even an ideal program will produce junk if the underlying
data
has been corrupted before analysis. The validation and security of
spreadsheets is thus a problem in its own right, as R. Allan Reese points
out.
EuSpRIG -- the European Spreadsheet Risks Interest Group -- has a
web-site
at
http://www.gre.ac.uk/~cd02/eusprig/ containing some interesting
material and links.
On 19Dec04,
Ted.Harding@nessie.mcc.ac.uk wrote:
There is a huge literature on this topic, some of it published
in journals, much of it floating around on the web and in the
archives of mailing lists.
Tim's reference above is interesting, but only one example.
The
McCullough and Wilson reference given there, though now
somewhat dated, identifies many of the classic problems.
Googling on
mccullough wilson excel
will throw up a host of followups.
Informed statistical comment on the problems of Excel encountered
by serious users can be found by browsing in the mailing list
ASSUME (Association of Statistics Specialists Using Microsoft Excel):
http://www.jiscmail.ac.uk/lists/assume.html. The most recent serious issue reported there is the RAND() bug:
see ASSUME archives for Dec 2003 followed up in the March 2004
archives. The latter point to a statement from Microsoft:
SYMPTOMS
When you use the RAND function in Microsoft Office Excel 2003,
the RAND function may return negative numbers.
CAUSE
This problem may occur when you try to use many random numbers,
and you update the RAND function multiple times. For example,
this problem may occur when you update your Excel worksheet by
pressing F9 ten times or more.
RESOLUTION
This problem is fixed in the Microsoft Excel 2003 Hotfix Package
that is dated February 29, 2004.
http://support.microsoft.com/default.aspx?scid=kb;en-us;834520
(and the deeper you probe in this, the worse it gets). While using Excel for statistics has some limited value in the
context of initiating to statistics students whose IT experience
is limited to exposure to courses on Excel and Word, and the
teacher wants to build on such experience, I think that Excel
should never be used for serious statistical work, for several
reasons.
- The many reported (and some allegedly fixed) bugs in calculation and algorithms necessarily provoke suspicion that others still exist or may have been introduced. One simply cannot trust the results without checking.
- Too many things can be done silently and invisibly, "behind the spreadsheet", by Excel. Changes to data and differences between what is shown on the spreadsheet and what goes into exported files can arise without the user being aware of them. A particularly frightening example is the "sort" disaster reported to ASSUME (8 Dec 2003) by Allan Reese.
- Excel has some value as a straightforward data entry pad. However, I have seen far too many cases where sloppy usage has led to the resulting spreadsheet containing "information" which is either superfluous or wrong, in ways which would not be obvious to the user. For example, a "missing data" cell, if blank, may be interpreted as having value zero. Some people enter "." for missing data, but often are not consistent. If inadvertently a space is entered in a cell outside the intended row/column range of the data (or, I suspect, even if the spreadsheet cursor wanders outside the range) then when the sheet is exported (e.g. as "CSV") these extra rows and columns will be included. In one case I received an Excel spredsheet with hundreds of such extra rows and dozens of extra columns, together with dozens of cases where " " and "." had been used inconsistently, all this over and over on each of about 6 "worksheet" pages; not to mention data in the wrong columns etc. It took about 4 days of continuous work to clean this up. To be frank, for entering complex data the discipline enforced by a properly designed Data Entry Form in a database package would avoid such problems altogether, and such should be used. The illusion of success that Excel gives the user is a most treacherous danger and frankly I simply do not, in the first instance, trust data in a spreadsheet.
- The use of formulae in cells to generate cell values can cause all sorts of problems. One to especially watch out for is that a formula may have been wrongly or inappropriately "copied" from one column to another or from one worksheet to another. You can of course check this by moving the cell cursor to such cells and noting what the formula is, but as you can imagine this is a horribly uhpleasant process (and by the way take care that you don't inadvertently alter it while you're doing this!). Also see Allan Reese's "sort" disaster above, which was formula-induced.
I could go on. I've written at length already because many readers of R-help may be in situations where they necessarily receive data in Excel files, or have to use Excel themselves, and may not yet have become aware of the risks. So I'm writing as a warning to them: Don't trust Excel, but if you must use it then check everything, make sure it's what it should be, and
make sure that it stays that way when the spreadsheet is accessed (as in (3) or (4) above, things may change invisibly).