gnumeric
spreadsheet team corrected computational errors that Microsoft was unable to fix in Excel
: http://www.csdassn.org/software_reports/gnumeric.pdf
From: P.J.Wells@OPEN.AC.UK [mailto:P.J.Wells@OPEN.AC.UK] Sent: Donnerstag, 8. März 2001 19:34 To: allstat@JISCMAIL.AC.UK Subject: Summary: Excel for statistics
Last month I posted a request for information about the alleged deficiencies of Excel for serious statistical work to the Allstat and Excel-G lists.
This is evidently a widespread concern, as I received 37 replies either offering suggestions or asking to be kept informed. Unfortunately, pressure of work has prevented me from compiling a full summary of the many helpful replies until now.
My original interest in this was prompted by discussion on the RadStats list, so I am posting this there too, as well as copying it to all those who asked individually for a summary; apologies for the cross-posting which will result. (As AllStat rejects attachments, please contact me direct if you would like to receive the ones referred to below). I've tried to give appropriate credit for specific points, but as might be expected a number of references were the subject of multiple independent messages, so I hope that the following generic credit will do justice toAndrew Bertie Jan Beyersmann Jill Binker Malcom Campbell Dennis Chanter John Cobby Ronan Conroy Nick Cox Peter Das Jeremy Dawson Bjorn Donnis Russell Ecob Charles Fleischer Tom Hawkins Patrick Laycock Rashid Nassar R.Allan Reese James Wallace Jay Warner Knut Wittkoski Many thanks to you all. Julian Wells OU Business School The Open University Walton Hall Milton Keynes MK7 6AA United Kingdom +44 1908 654658 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1) Fundamental shortcomings of digital computation
Part 1 discusses these basic problems of digital computing, including e.g. truncation and rounding, and the use of standard data sets to test statistical programs. Part 2 describes the application of the methodology set out in Part 1 to SAS, SPSS and S-Plus. All three packages had weaknesses in random number generation, S-Plus's correlations were suspect, as are one-way ANOVA and non-linear least squares in SAS and SPSS.
As a non-specialist, I found Part 1 especially chastening. For a simple demonstration I devised the following (prompted by a Excel-G post on the autofill feature); try:Enter "100" in A1 Enter "=A1-0.1" in A2 Use the auto-fill facility to copy A2 down through (say) A110 Set Excel to display 13 places of decimals in the relevant cells.How many of these cells would you want to use as the basis for any further calculations which depended on the assumption that the contents were exact multiples of 0.1? 2) Special problems of Excel
McCullough, B. D. and Wilson, B. (1999) On the accuracy of statistical procedures in Microsoft Excel 97. Computational Statistics and Data Analysis 31:27-37.For those whose institutions have the necessary subscriptions, this is available on-line at http://www.elsevier.nl/gej-ng/10/15/38/37/25/27/article.pdf.
As yet, I have not had a chance to look at this myself, but one respondent who has seen it described it as "not very complimentary!".
(b) Peter Das forwarded a number of messages from his archives, including one very critical of Excel's random number generation: it suggests that there is "no theoretical basis" for the generator used in Excel.
For those for whom this is an important issue, the same message also identified the following URLs offering "genuinely" random numbers derived from physical data: http://www.fourmilab.ch/hotbits/ (radioactive decay) http://www.random.org/ (atmospheric noise) Unfortunately the most fun-sounding such service, http://www.lavarand.sgi.com/, using a lava lamp (!), appears to be no longer available. (c) Knut M. Wittkowski is working on a (very impressive) animated PowerPoint presentation on a number of issues, of which the current version is available from http://www.rucares.org/Course/Excel.ppv.(d) Nick Cox pointed out the eccentric nature of Excel's approach to quartiles in the following message, along with other comments: his last sentence is a point worth bearing in mind.
++++++++++++++++++++++
...
Here is something at the elementary end. Asked to give the quartiles of 1(1)10 Excel gives 3.25 and 7.75. It is quite an achievement for Excel to use a formula for quartiles unknown to statistics, because at least half-a-dozen or so exist (sometimes people interpolate, sometimes they like to do nothing more complicated than split the difference between two numbers, etc.), but I believe that Excel's formula is not known in the literature. When challenged on this, the burden of Microsoft's response was apparently that this was to be compatible with other spreadsheets. Bug-for- bug compatibility, in other words.
On allstat a few years ago, a weird example was mentioned. Enter a column of zeros and set a cell to contain their sum. Now change one cell to O (upper case O). Now change another cell to l (lower case l). What happens? What would be reasonable behaviour for statistical purposes? Do you want to trust this software?As I understand it, Excel offers no facility for keeping a log or history of what you did in a session. Being able to go back through the steps you took, or that somebody else took, is vital for serious analysis. This is especially because little worthwhile analysis can be done on the basis of "plugging the numbers into a single formula" or "applying Foobar's test".
It is not a fatal defect, but Excel graphics include several types which I think unsound in principle (e.g. various 3-D effects which hinder appreciation of the data). Some appear to be there largely for those users who want something more bizarre than simple direct designs.
The main arguments for using Excel for statistics appear to be that many people are using it already; have it on machines which are theirs or accessible to them; are familiar with the style of computing it offers. These can be powerful -- if not overwhelming -- considerations for some groups of users in some circumstances, especially considering the cost of statistical software and the fact that in many cases a command language needs to be grappled with. But I have not yet heard, in years of asking, any example of how Excel is intrinsically superior to statistical software for any statistical purpose.
++++++++++++++++++++++
(e) Ronan Conroy sent the following example of Excel's dubious behaviour:
x1 x2 x3 y 1 2 3 21 3 2 5 33 5 4 9 45 7 4 11 49 9 6 15 61You do a model of y = x1 x2 x3
Excel gives you a load of old tosh.
Significant p-values for all x variables (despite the fact that x3 = x1+x2) and an adjusted R^2 of 0.927, but an 'overall' p-value for the regression of 0.171!
I wouldn't trust Excel's stats as far as I could throw them!
There was some discussion of this on edstat-l, and we compared the error messages of various stats packages when faced with data like this. Some were good and some were not, but Excel was the only package that went ahead and calculated the nonsense regression requested.
(b) The attached summary (excelbug.doc) by an anonymous author was passed on by Jeremy Dawson
(c) Doug Chanter supplied another (attached as Exceleng.doc), which not only documents problems with Excel's implementation of various procedures, but makes the important point that the documentation is also unreliable.It fact, the documentation is not even reliably unreliable -- sometimes it is wrong when the procedure is OK.
This point is perhaps of special relevance to those teaching statistics, who should warn their students off using Excel's documentation as a textbook.
4) Other resourcesIt is also the case that S-Plus and SAS, and possibly other programs, offer the facility to operate as plug-ins to Excel. For example, the S-Plus plug-in allows you to operate on your Excel data and insert an S-Plus chart into the Excel worksheet. From personal experience I have to say that this seems a rather cumbersome way of going on, especially if one's final output is going to be a report in Word.
(c) As may be imagined, there are many other lists where this issue ahs been discussed, such as edstat and ap-stat
(d) Finally, even an ideal program will produce junk if the underlying data has been corrupted before analysis. The validation and security of spreadsheets is thus a problem in its own right, as R. Allan Reese points out.
EuSpRIG -- the European Spreadsheet Risks Interest Group -- has a web-site at http://www.gre.ac.uk/~cd02/eusprig/ containing some interesting material and links.There is a huge literature on this topic, some of it published in journals, much of it floating around on the web and in the archives of mailing lists.
Tim's reference above is interesting, but only one example. The McCullough and Wilson reference given there, though now somewhat dated, identifies many of the classic problems. Googling onmccullough wilson excel
will throw up a host of followups.
Informed statistical comment on the problems of Excel encountered
by serious users can be found by browsing in the mailing list
ASSUME (Association of Statistics Specialists Using Microsoft Excel): http://www.jiscmail.ac.uk/lists/assume.html. The most recent serious issue reported there is the RAND() bug:
see ASSUME archives for Dec 2003 followed up in the March 2004
archives. The latter point to a statement from Microsoft:
SYMPTOMS When you use the RAND function in Microsoft Office Excel 2003, the RAND function may return negative numbers. CAUSE This problem may occur when you try to use many random numbers, and you update the RAND function multiple times. For example, this problem may occur when you update your Excel worksheet by pressing F9 ten times or more. RESOLUTION This problem is fixed in the Microsoft Excel 2003 Hotfix Package that is dated February 29, 2004.http://support.microsoft.com/default.aspx?scid=kb;en-us;834520 (and the deeper you probe in this, the worse it gets). While using Excel for statistics has some limited value in the context of initiating to statistics students whose IT experience is limited to exposure to courses on Excel and Word, and the teacher wants to build on such experience, I think that Excel should never be used for serious statistical work, for several reasons.