Archive for the ‘python’ Category

SOFA Statistics and the “R is an Epic Fail” blog

Monday, April 26th, 2010

R is an open source programming language and software environment for statistics. And it is not just any old programming language – it is the dominant system for open source statistics. So was it fair to call R an “epic fail” as Dr. AnnMaria De Mars did in her notorious blog post The Next Big Thing?

Clearly R has been a massive success and it has a vibrant and lively community, many of whom were galvanised into making a response by the Epic Fail blog (see An article attacking R gets responses from the R blogosphere – some reflections on the phenomenon and R and the Next Big Thing as an example). So on what terms could it be considered a failure? For De Mars, successful software will be usable by the vast majority of people – not just programmers and others comfortable with command line interfaces.

… if you even LOOK at R code – bug-free or not, compilable or not – it should be evident that this is not how the average person uses a computer. If we are talking about something that is going to be used by a large number of people, R is not it (Comment by De Mars on her own blog post – The Next Big Thing).

… If your target market is “People who own cars that drive from point A to point B” that is much BIGGER than “people who work on engines”. If you are looking for a job making things or selling things or providing services, the former is more likely to pay off for you than the latter.
Telling people that if they can’t appreciate an internal combustion engine they are too stupid to own a car probably won’t help, either.” (The Next Big Thing).

And in these terms, De Mars has a point. For many users, R needs a GUI. I like this quote tweeted by ravkalia (a big fan of R BTW): “Overheard at a computing meeting: ‘R is not a programming language, it’s a statistics package with the GUI missing.’” Of course there are various projects to provide a GUI interface for R but it can be argued there are limits to how far that can go given the inherent flexibility of R as an environment. Yihui Xie recently commented – “I prefer the command-line due to its flexibility. GUI cannot hold infinite components (buttons, drop-lists, check-boxes, …), whereas there are almost infinite possibilities in commands.” (r-is-an-epic-fail).

On her other points regarding R and data visualisation, and analysis of enormous quantities of unstructured data, De Mars is on shakier ground, but the observations about the mainstream preference for looking and clicking are valid.

So how does this relate to SOFA Statistics? SOFA stands for Statistics Open For All, which gives a strong hint as to where SOFA is aiming in terms of user interfaces and target audience. In practice this means:

  1. A simple GUI. In practice, this means trying hard to leave the right things out rather than adding in every possible option. Sometimes less is more. Think about your TV remotes.
    Interface chaos

    Interface chaos

    Some commentators have implied that a GUI is not important because the sorts of people who do statistics will also be comfortable with basic programming. But this is not always true. And lots more people, by several orders of magnitude, need to run basic statistical analyses than just specialist statisticians. Karen Grace-Martin put it especially well in her response to the Epic Fail post:

    “I primarily help researchers, mainly in biology and social science, apply statistics to their research. They are not doing “business analytics,” do not have enormous databases, and really have no need to program anything beyond what SAS or SPSS syntax does. They are not programmers or statisticians, and they don’t have backgrounds in programming or math.

    I believe they are the kinds of users of statistics that you are referring to and I agree with you wholeheartedly that they are probably the majority of statistics users and they have no need for a programming language. They don’t want to nor need to program new statistical procedures.

    There are clearly people who do, but I agree they’re not the majority. At least not in the fields I work.” (The Next Big Thing).

    Even full-time specialist statisticians may find it easier to use a simple GUI for basic data exploration e.g. generating simple frequency tables and cross tabs. It has been suggested that people should expect to use more than one package (SPSS, SAS, R, Stata, JMP? Choosing a Statistical Software Package or Two) SOFA Statistics may be a useful complement to R for many users.

    And ease of use should not be premised on the assumption that people will be heavy users of the package – or of statistics in general, for that matter. The program needs to make it easy to become productive in a hurry.

  2. High priority on aesthetics. Output needs to look attractive; beautiful if possible.
    Lucid spirals demo

    Lucid spirals demo

    Even the program itself needs to look good:

    Form for selecting appropriate statistical test

    Form for selecting appropriate statistical test

  3. One True Way of Doing Things. It is not enough that there is a way of doing something – it can’t be buried somewhere obscure, and it has to clearly stand out as being correct and current (unlike some community technical advice).

    * In the Zen of Python (type import this into your Python interpreter) there is this gem: “There should be one– and preferably only one –obvious way to do it.”

  4. Helping the user when errors occur. Ideally, there would never be any errors but given there are it is important to make them as useful as possible. This is an ongoing project in SOFA Statistics which is being given a high priority. Error messages are an important part of the interface and one of the most important to get right. The better the error messages, the less support people need and the happier they are (under the circumstances). Jon Peck commented on an unhelpful error message he receives from R:

    Here is an error message that I get a lot from a popular R package.
    ‘Error in optim(0, f, control = control, hessian = TRUE, method = “BFGS”) :
    non-finite finite-difference value [1]‘
    I know what that means. Would an analyst?
    (Jon Peck – in response to The Next Big Thing)

  5. Not relying on users to stitch together everything they need. Ordinary users benefit if their application bundles together related output. This is a balancing act and one which we want to get right for the target user group for SOFA Statistics. The following quote captures the tradeoffs well:

    But one thing is clear to me: R aims at people who know what they are doing. Absolutely. You can see this with standard output in R which is very minimalistic. You must ASK R what you want from it. SAS and SPSS put everything out. And therefore you need to know how to program in R to use it, really. But if you do, you feel bound and limited with SAS or SPSS. (comment by mocianmomo in response to SAS v. R: Ease of learning).

  6. SOFA Statistics uses Python for Scripting. Python is a language consciously designed to be easy to learn. Many statisticians find it a pleasure to work with Python but the same is not always true of the syntax of many statistics packages, especially those with lots of historical cruft.
  7. Example SOFA script

    Example SOFA script in Python

SQL & integer division (why 5/2 usually equals 2!)

Monday, March 15th, 2010

I came across integer division in Python 2.x. If you divide one integer by another you get an integer result. So 5/2 = 2 instead of 2.5. You get floor division, not true division (Python – Changing the Division Operator). In Python 3, true division is the default (thank goodness) but in Python 2.x you need to make one of the numbers a float to get a float returned. So 5.0/2 = 2.5. I was bitten by this early on and know the standard way of handling it.

What I didn’t know was that integer division was the norm in SQL database SELECT statements. I had mainly been using MySQL and MySQL was pretty unique as it turned out:

MySQL by default does floating point division, even if both operators are of type INTEGER, so the above [1/2] would return 0.5 in MySQL. All of the other database engines tested do integer division, and return an integer result. (SQLite – Differences Between Engines).

Anyway, in SOFA Statistics, row and column percentages were affected by this behaviour and always returned x.0 %. There was never anything other than zero after the decimal point. The fix was very simple. Instead of SELECT … 100*(num/denom) the relevant code is SELECT … 100.0*(num/denom). The 100 is now a float for those who missed that small but significant difference.

0.8.11 provides internationalisation support and a major fix for Vista/Windows 7

Monday, November 9th, 2009

The latest version of SOFA Statistics has some important improvements.

  • Fixed major bug preventing interaction with data on Vista/Windows 7. It was caused by the “\U” combination inside project configuration files (e.g. C:\Users\…). The backslash U combination was treated as the start of a unicode string (international text etc) but as an invalid one. Windows testing using XP didn’t pick this up because the venerable “Documents and Settings” folder in XP has been replaced with the “Users” folder in Vista and Windows 7.
  • Better support for international text and unicode e.g. René, Identität, François etc.
  • Better responses to errors saving data to database tables. For example, if a user tries to save to database a word with characters in it not supported by the underlying database table (such as a unicode
    letter not found in the Latin character set).
  • For Galician speakers, a version of SOFA Statistics in their own language (currently only working in Ubuntu).

There is also a new version of wxWebKit etc available for Karmic (9.10) users thanks to Christoph Willing. NB this will also help some users of Jaunty (9.04) who have updated packages which conflict with those in SOFA Statistics. More details can be found at http://www.sofastatistics.com/predeb.php.

Multi-language SOFA Statistics Begins

Saturday, October 24th, 2009

Launchpad offers great support for translating applications into different languages (https://help.launchpad.net/Translations).  And Python http://docs.python.org/library/i18n.html (and wxPython http://wiki.wxpython.org/Internationalization) have standard ways of supporting multiple languages.  So it was always going to be achievable to make SOFA Statistics multilingual as long as people were willing to help with translation.  First to raise their hand has been Indalecio Freiría Santos (see SOFA Statistics discussion thread) and the Galician version should be available first.  If you are interested in adding translations please feel free to raise your hand in the discussion group http://groups.google.com/group/sofastatistics at any time.

wxPython hourglass cursor not working in Ubuntu

Monday, August 17th, 2009

The following code worked in Windows but not in Ubuntu:

# hourglass cursor

curs = wx.StockCursor(wx.CURSOR_WAIT)
self.SetCursor(curs)
Something happens that takes a while … … … …
# Return to normal cursor
curs = wx.StockCursor(wx.CURSOR_ARROW)
self.SetCursor(curs)

Use instead:

wx.BeginBusyCursor()
wx.EndBusyCursor()

NB good to use wx.IsBusy() with EndBusyCursor().  On Windows, ending a cursor if one is not running causes an error.

if wx.IsBusy():
    wx.EndBusyCursor()

Misc library issues

Monday, August 17th, 2009

Re: pysqlite-2.5.5-win32-py2.6.exe – it wouldn’t install on my clean virtual XP environment.  It was unable to locate the component msvcr71.dll. So I was forced to include that in the Windows package.

The mysqldb module doesn’t currently have an official 2.6 version of the Windows installer.  Which was the main reason I had kept the Windows version to Python 2.5 for which there was one  (SciPy was no longer relevant so shifting to 2.6 for all installers was definitely in contention).  And there had been mixed experience of mysqldb packages put together by third parties (https://sourceforge.net/forum/forum.php?thread_id=2316047&forum_id=70460).  But I really needed a feature which was introduced in Python 2.6 – namely the float method as_integer_ratio.  This was needed to enable my float to decimal function to work (http://docs.python.org/library/decimal.html) which I needed to get the level of precision required to pass the hardest NIST ANOVA test (http://www.itl.nist.gov/div898/strd/anova/SmLs09.html). In the end I went with http://www.thescotties.com/mysql-python/test/MySQL-python-1.2.3c1.win32-py2.6.exe.  Another option was http://www.codegood.com/archives/4.

BTW there is a lot to like about Python 2.6 – it is the gateway to the 3 series and will make that eventual transition a lot easier.

The decimal module in Python

Wednesday, August 12th, 2009

Python has a brilliant decimal module (http://docs.python.org/library/decimal.html) you may need if you want to avoid floating point errors.  This may be necessary if you are faced with compounding errors under special circumstances e.g. if testing a statistical routine against a purpose-built test dataset (e.g. http://www.itl.nist.gov/div898/strd/anova/SmLs09_cv.html).  The performance hit is substantial, however, so it has to be used judiciously.  Anyway, here is an example:

import decimal
D = decimal.Decimal
decimal.getcontext().prec = 120
d1 = D("1.1")
f1 = 1.1
print "Decimal result is: %s" % round((d1**1000 - D("2.46993291801e+41")),3)
print "Floating point result is: %s" % round((f1**1000 - 2.46993291801e+41),3)
>>>

Decimal result is: -4.17366587591e+29
Floating point result is: -3.97456123863e+29

Usually, floating point is good enough – but not under all circumstances.  In which case, it pays to be familiar with the decimal module.

Adding ability to import from csv and spreadsheets etc

Tuesday, June 9th, 2009

SOFA Statistics is having new import functionality added.  The first target is csv format files (using the standard Python csv module underneath) followed by Excel spreadsheets.  The solution I have for Excel works even when MS Office has not been installed on a machine but will only work in Windows.   Later on I will target SPSS data files and Open Office Calc spreadsheet files.

Resolving Windows installation glitches

Tuesday, May 26th, 2009

If you had problems installing the Windows version of SOFA 0.6.8, try 0.7.0 (http://www.sofastatistics.com/misc/sofa-0.7.0_python-2.5.zip).  It resolves the main issues with that installation package.  Version 0.7.0 also resolves some other issues within SOFA and represents the first of the 0.7 series – the goal of which is to enable importing of spreadsheets and other, non-SQL database type data.

The Windows comtypes package relied upon by SOFA 0.6.8 proved to be faulty.  The 0.7.0 version of SOFA Statistics, which has just been released, uses an older version of comtypes (0.5.2) which is known to work.  A version of comtypes 0.6.0 for python 2.5 is apparently forthcoming.

There is still a delay when using SOFA’s table making functionality for the first time while comtypes generates some data it needs.  NB this is a one-off delay which doesn’t affect anything else.  Ideally, SOFA will handle this process better in a forthcoming release.

Windows package will be Python 2.5 only for now

Saturday, May 16th, 2009

Unfortunately SciPy does not have a Python 2.6 installer yet.  The MySQLdb package does but it is not from the central sourceforge location (http://www.technicalbard.com/files/MySQL-python-1.2.2.win32-py2.6.exe). Also see http://bytes.com/groups/python/854793-will-mysqldb-python-shim-supported-python-2-6-3-x.  For the time being, therefore, the Windows package for SOFA will be a Python 2.5 version only.

BTW nearly ready to release the packages once project hosting is finalised.

The licence is AGPL 3.