Archive for the ‘statistics’ Category

Importing tab-separated files and more

Saturday, March 15th, 2014

SOFA 1.4.3 now lets you import directly from tab-separated/tab-delimited files.

Importing tab-delimited text
Another change is less momentous but should really please people doing lots of row stats reports. As SOFA gained more measures it became increasingly more effort to select individual measures one-by-one, checkbox by checkbox. Now there is a toggle button for Select All/Deselect All. Much better :-).

Select or Deselect All

And the bonus themes are now part of the standard release making it easier than ever to make your charts and tables look good – I hope you like them.

New Themes

One change that most users will never notice is better support for running SOFA via scripts. An exciting automation project is currently under development using this functionality and I hope to have some news to share soon.

Here’s the full list of changes:

  • Can import tab-delimited data.
  • More options for attractive charts and reports. Three new themes available – sky, prestige (screen), and prestige (print).
  • Better support for automation (i.e. headless, running without GUI) esp in international context.
  • Exporting to spreadsheet now relies on more robust code library (xlwt)
  • Easy to select or deselect lots of row stats measures at once.
  • Faster opening in many cases.

And the bug fixes:

  • Minor tweak to PostgreSQL plug-in to handle timestamps without timezone.
  • Resolved bug when SQLite numbers are stored in a non-numeric field and processed for Chi Square test.
  • Importing csvs now copes better when only missing vals in sample of a field. Gives user the choice.
  • Fixed bug when doing a Row Stats table with a rows variable e.g. by Gender and some of the fields can’t be calculated for some of the row categories.
  • Headless importing now works in the event of inconsistent data types in fields.
  • Headless importing now reads entire dataset rather than a sample to avoid need for (human) decisions.
  • Scripts no longer rely on translated arguments. Much safer to use on other machines with different locales.
  • Fixed circular import bugs which only became visible when other bugs occurred.

Using SOFA alongside other statistics packages

Saturday, February 15th, 2014

All statistics packages have their strengths and weaknesses so it is not uncommon for people to want to use more than one – even on the same project. SOFA is focused on making it easy to use some core statistical tests and producing attractive, high definition tabular and charting output. SOFA also makes it easy to link to, or import from, a wide range of formats: xls, xlsx, csv, google docs spreadsheets, MS Access, NySQL, MS SQL Server, PostgreSQL, SQLite, and, more recently, CUBRID.

But there is no point overcomplicating SOFA so it can do every statistical test that might be needed for a particular project. SOFA users have been routinely surveyed on what features they would like added and it has not consolidated into a clear list of priorities. People need lots of different things depending on their specific projects.

So a sensible goal for SOFA is to make it easy to import and export data, including metadata such as variable and value labels. This strategy has already resulted in the addition to SOFA (version 1.4.2) of built-in export-to-spreadsheet functionality. And it has already been improved for version 1.4.3 (not released yet at the time of this writing).

The question is, what packages should SOFA target as priorities for interoperability? Feel free to fire me off an email at

Confidence Intervals for ANOVA & t-tests in 1.3.3

Friday, April 5th, 2013

95% confidence intervals have now been added to ANOVA and t-tests. And associated output has right justified numbers to make it easier to read.

Confidence Intervals

Version 1.3.3 also lets you sort by category labels in clustered bar charts, line charts, area charts and box plots. Area charts can also be sorted ascending or descending by count/mean/sum.

The series and category are now displayed in tooltips e.g. Italy, 20-29 for clustered bar charts, multi-series line charts, and box plots. This is especially helpful when there are lots of categories and/or series.

Boxplot Improvements

  • Improved statistics output footnotes.
  • Borders on bar-type charts are now optional. This can be useful when bars are very short.
  • Chi square clustered bar charts can cope with higher default limits for number of values.
  • Importing field names with more than 90 characters prohibited at the point of import rather than causing problems later.
  • The group by max number of values is now controlled by a single my_globals setting (making it easier to override).
  • The default settings for some remaining max values have been increased.

There was one minor bug-fix this version – line charts now cope better with lots of categories (increased padding around max label width in overall width calculations).

And a problem with the deb installer was also fixed.

Further improvements in 1.1.4

Friday, February 24th, 2012

The latest version adds a range of improvements:

  • Added lower and upper quartiles to Row Stats report tables.


  • Box plots now start y-axis from just below the minimum y value of the data unless the content is close enough to the bottom of the graph to make it worth using 0 anyway.
    Y axis adjusted automatically  for box and whisker plots
  • Showing the percent sign in percent columns for report tables is now optional – which is good news for many dissertation students.

    Show (or hide) percentage symbols

  • SOFA now displays value labels sorted by the numerical version of numbers even if stored as text. So no more 1, 11, 2,3 etc in cases where people have stored the number as a Text data type.
  • Added some more valid US date formats using dot dividers.
  • New help button for importing data.
  • New help button to advise on how to make of flexible data filters.
  • English translations are handled better (no more messages about not having US English and using UK English instead etc).

Plus there are some useful bug fixes:

  • Fixed bug where getting observed values e.g. for chi square test, fell over when one field in pair had missing values while the other didn’t.
  • Fixed bug in calculation of upper and lower whiskers in box plots.
  • Single bar charts don’t show a bar title anymore – only needed if multichart.
  • Fixed bug which only changed variable definitions when the extra settings dialog was closed with OK and didn’t ever set it otherwise e.g. when changing the selected project.
  • Now copes with newer versions of matplotlib on Linux.
  • No longer stores empty strings as variable labels if user doesn’t enter a label.

New tutorial videos on SOFA Statistics

Sunday, January 15th, 2012

Check out these two new tutorial videos for SOFA Statistics:

0.9.13 Recode data e.g. age to age group; better stats support

Saturday, July 10th, 2010

SOFA Statistics 0.9.13 has a number of exciting new features.

1) Easy recoding of data e.g. age to age group:

Recoding data e.g. age to age group

Recoding data e.g. age to age group

2) Better support when choosing statistical tests:

When a selection is made SOFA displays some helpful tips to affirm the choice made or suggest alternatives.

Helpful tips

Helpful tips

The dialog also makes it easy to answer questions about your data which will help make a decision.

Answering questions about your data

Answering questions about your data

3) Better support for importing CSV files with different data encodings:

Confirm encoding when importing csv files

Confirm encoding when importing csv files

4) Nicer Windows installer

New Windows installer

5) More flexible installation options. Anything containing ‘sofa’ is OK.

There have also been some important bug fixes:

  • Recoding copes with REMAINING keyword properly and copes with varied field types properly.
  • Generally copes better with system encodings like Chinese Traditional (big5).
  • CSV importing handles non-English much better.
  • Corrected handling of non-English characters when errors are encountered.

SOFA Statistics growing in popularity

Wednesday, June 30th, 2010

June 2010 was easily the best month for SOFA Statistics downloads with 2,296 recorded. This is very encouraging for the project and suggests there is demand for a statistics package prioritising ease of use and aesthetics.

Sourceforge downloads for June 2010

Sourceforge downloads for June 2010

The priority in coming releases is the addition of output charting. Some extra work will also go into the Statistical Test Selection Dialog to ensure the correct balance is struck between providing simple advice to beginners and recognising the multitude of factors which may be relevant to making a decision.

As always, users are encouraged to provide feedback on what they like/don’t like about the program. The goal is to make the best application possible so all feedback is welcome.

SOFA Statistics and the “R is an Epic Fail” blog

Monday, April 26th, 2010

R is an open source programming language and software environment for statistics. And it is not just any old programming language – it is the dominant system for open source statistics. So was it fair to call R an “epic fail” as Dr. AnnMaria De Mars did in her notorious blog post The Next Big Thing?

Clearly R has been a massive success and it has a vibrant and lively community, many of whom were galvanised into making a response by the Epic Fail blog (see An article attacking R gets responses from the R blogosphere – some reflections on the phenomenon and R and the Next Big Thing as an example). So on what terms could it be considered a failure? For De Mars, successful software will be usable by the vast majority of people – not just programmers and others comfortable with command line interfaces.

… if you even LOOK at R code – bug-free or not, compilable or not – it should be evident that this is not how the average person uses a computer. If we are talking about something that is going to be used by a large number of people, R is not it (Comment by De Mars on her own blog post – The Next Big Thing).

… If your target market is “People who own cars that drive from point A to point B” that is much BIGGER than “people who work on engines”. If you are looking for a job making things or selling things or providing services, the former is more likely to pay off for you than the latter.
Telling people that if they can’t appreciate an internal combustion engine they are too stupid to own a car probably won’t help, either.” (The Next Big Thing).

And in these terms, De Mars has a point. For many users, R needs a GUI. I like this quote tweeted by ravkalia (a big fan of R BTW): “Overheard at a computing meeting: ‘R is not a programming language, it’s a statistics package with the GUI missing.'” Of course there are various projects to provide a GUI interface for R but it can be argued there are limits to how far that can go given the inherent flexibility of R as an environment. Yihui Xie recently commented – “I prefer the command-line due to its flexibility. GUI cannot hold infinite components (buttons, drop-lists, check-boxes, …), whereas there are almost infinite possibilities in commands.” (r-is-an-epic-fail).

On her other points regarding R and data visualisation, and analysis of enormous quantities of unstructured data, De Mars is on shakier ground, but the observations about the mainstream preference for looking and clicking are valid.

So how does this relate to SOFA Statistics? SOFA stands for Statistics Open For All, which gives a strong hint as to where SOFA is aiming in terms of user interfaces and target audience. In practice this means:

  1. A simple GUI. In practice, this means trying hard to leave the right things out rather than adding in every possible option. Sometimes less is more. Think about your TV remotes.
    Interface chaos

    Interface chaos

    Some commentators have implied that a GUI is not important because the sorts of people who do statistics will also be comfortable with basic programming. But this is not always true. And lots more people, by several orders of magnitude, need to run basic statistical analyses than just specialist statisticians. Karen Grace-Martin put it especially well in her response to the Epic Fail post:

    “I primarily help researchers, mainly in biology and social science, apply statistics to their research. They are not doing “business analytics,” do not have enormous databases, and really have no need to program anything beyond what SAS or SPSS syntax does. They are not programmers or statisticians, and they don’t have backgrounds in programming or math.

    I believe they are the kinds of users of statistics that you are referring to and I agree with you wholeheartedly that they are probably the majority of statistics users and they have no need for a programming language. They don’t want to nor need to program new statistical procedures.

    There are clearly people who do, but I agree they’re not the majority. At least not in the fields I work.” (The Next Big Thing).

    Even full-time specialist statisticians may find it easier to use a simple GUI for basic data exploration e.g. generating simple frequency tables and cross tabs. It has been suggested that people should expect to use more than one package (SPSS, SAS, R, Stata, JMP? Choosing a Statistical Software Package or Two) SOFA Statistics may be a useful complement to R for many users.

    And ease of use should not be premised on the assumption that people will be heavy users of the package – or of statistics in general, for that matter. The program needs to make it easy to become productive in a hurry.

  2. High priority on aesthetics. Output needs to look attractive; beautiful if possible.
    Lucid spirals demo

    Lucid spirals demo

    Even the program itself needs to look good:

    Form for selecting appropriate statistical test

    Form for selecting appropriate statistical test

  3. One True Way of Doing Things. It is not enough that there is a way of doing something – it can’t be buried somewhere obscure, and it has to clearly stand out as being correct and current (unlike some community technical advice).

    * In the Zen of Python (type import this into your Python interpreter) there is this gem: “There should be one– and preferably only one –obvious way to do it.”

  4. Helping the user when errors occur. Ideally, there would never be any errors but given there are it is important to make them as useful as possible. This is an ongoing project in SOFA Statistics which is being given a high priority. Error messages are an important part of the interface and one of the most important to get right. The better the error messages, the less support people need and the happier they are (under the circumstances). Jon Peck commented on an unhelpful error message he receives from R:

    Here is an error message that I get a lot from a popular R package.
    ‘Error in optim(0, f, control = control, hessian = TRUE, method = “BFGS”) :
    non-finite finite-difference value [1]’
    I know what that means. Would an analyst?
    (Jon Peck – in response to The Next Big Thing)

  5. Not relying on users to stitch together everything they need. Ordinary users benefit if their application bundles together related output. This is a balancing act and one which we want to get right for the target user group for SOFA Statistics. The following quote captures the tradeoffs well:

    But one thing is clear to me: R aims at people who know what they are doing. Absolutely. You can see this with standard output in R which is very minimalistic. You must ASK R what you want from it. SAS and SPSS put everything out. And therefore you need to know how to program in R to use it, really. But if you do, you feel bound and limited with SAS or SPSS. (comment by mocianmomo in response to SAS v. R: Ease of learning).

  6. SOFA Statistics uses Python for Scripting. Python is a language consciously designed to be easy to learn. Many statisticians find it a pleasure to work with Python but the same is not always true of the syntax of many statistics packages, especially those with lots of historical cruft.
  7. Example SOFA script

    Example SOFA script in Python