Check out these two new tutorial videos for SOFA Statistics:
New tutorial videos on SOFA Statistics
January 15th, 2012Nice feature for dissertation students
January 13th, 2012A helpful user drew my attention to the desirability of adding a small but important feature for dissertation students – namely, the ability to leave the percentage symbol off the numbers in the percentage columns of frequency and cross tabulation report tables. This new feature will be in the forthcoming version of SOFA (1.1.4):

Here is the feedback from Doug:
For dissertation writing in the States, Turabian 7th edition and the Chicago Manual of Style 6th edition are standard for many graduate schools on both the masters and the doctoral level. In both cases, tables with percentage figures in them do NOT have percent signs in front of the numbers themselves, because a typical title like “Table 3. % of babies born to men over 40″ already tells you what’s inside the table.
Sofa Stats, however, so far as I can see, requires that percentages have the percent sign, which then gets dragged-and-dropped into Word (or Excel, for tidying up first). If the table is a small one, and if there are only one or two, no problem. But many dissertations have tons of them.
There is a way to rid a table of the % signs by using Excel, but it’s awkward and not a part of the regular menu system. I just worked it out myself a few hours ago, after spending half a day on the problem.
What would be *extremely* helpful to graduate students, whom I assume you would like to have as one of your key user groups, would be for you to program in a “switch” that would allow the user to specify percentages with or without percent signs. It’s a small detail, but one that would be much appreciated.
I generally try to avoid adding more features to SOFA in favour of keeping it simple but this seemed a good idea. Thanks again for the feedback Doug.
Honey I Shrunk the Installers
December 19th, 2011The SOFA installers for Windows and Mac have shrunk substantially – from 43MB to 25MB for Windows and from a rather hefty 85MB to 36MB for Mac. They’ll be quicker to download, and the new installers also avoid possible conflicts with other Python packages on a system. It’s all self-contained. A final benefit is that the installation process itself has become much simpler, with much fewer steps. For those who are technically minded, it is thanks to pyinstaller and py2app (with some initial help from Gui2exe).
Mainstream German Computer Magazine Reviews SOFA
December 18th, 2011SOFA has been reviewed and included in the software CD for a recent edition of Germany’s c’t magazine (c’t 2011 Issue 26 p.118). C’t (Magazin für Computertechnik) has a sold circulation of about 367,000 so it was wonderful to show up on their radar.
Making better installer for SOFA using Pyinstaller
December 9th, 2011As SOFA Statistics has gained more functionality it has grown in complexity – there are modules for reading Excel spreadsheets, connecting to Google Docs spreadsheets, displaying charts, displaying GUI widgets etc. Trying to make a single executable for Windows users was always going to be a challenge and would probably involve a lot of trial and error. So it proved.
But there was one technique I used to make the seemingly impossible task manageable. I made a single python script I called launch.py which was responsible for importing all the main modules the executable would need to handle (e.g. matplotlib, MySQLdb etc). I identified the imports I would need by looking at each and every main module in SOFA and adding any external library module imports not already included.
The process of making an executable failed initially, so by variously commenting and uncommenting parts of the launch script I was able to isolate problem modules and fix them. To get PostgreSQL working, for example, I needed to add the following fix:
try:
# I needed to add the Postgres library directory to the PATH
# variable in Windows. Apparently when Postgres is installed under Windows as a
# service, this isn't done automatically (no need to) so that library isn't
# available. [http://osdir.com/ml/python.db.pygresql/2008-03/msg00021.html]
# OK to hardwire to version available to my installer dev environment. The user experience
# will depend on whether they have set the PATH properly.
os.environ['PATH'] += ";C:\\Program Files\\PostgreSQL\\9.1\\bin"
import pgdb
except ImportError, e:
pass
Here is the full text of launch.py:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from __future__ import division # so 5/2 = 2.5 not 2 !
from __future__ import print_function
# remove import __future__ from dbe_sqlite
import cgi
import codecs
from collections import defaultdict
from collections import namedtuple
import copy
import csv
import datetime
import decimal
import gettext
import glob
import locale
import math
from operator import itemgetter
import os
import platform
import pprint
import random
import re
import shutil
import socket
import subprocess
import sys
import time
import traceback
from types import IntType, FloatType, ListType, TupleType, StringType
import warnings
import weakref
import webbrowser
import xml.etree.ElementTree as etree
import zipfile
# Even though not used here pyinstaller won't know about it otherwise
# and will not have it when encountered in import2run.py/start.py etc
import MySQLdb as mysql
try:
# I needed to add the Postgres library directory to the PATH
# variable in Windows. Apparently when Postgres is installed under Windows as a
# service, this isn't done automatically (no need to) so that library isn't
# available. [http://osdir.com/ml/python.db.pygresql/2008-03/msg00021.html]
# OK to hardwire to version available to my installer dev environment. The user experience
# will depend on whether they have set the PATH properly.
os.environ['PATH'] += ";C:\\Program Files\\PostgreSQL\\9.1\\bin"
import pgdb
except ImportError, e:
pass
import sqlite3 as sqlite # using sqlite3.dll from Python 2.7 so includes foreign key support
#import wxversion
#wxversion.select("2.8") # Not needed when using executable.
# http://groups.google.com/group/pyinstaller/browse_thread/thread/1b57e64ddc35e772
if not hasattr(sys, 'frozen'):
import wxversion
wxversion.select('2.8')
import wx
import wx.lib.iewin as ie
import wx.gizmos
import wx.grid
import wx.html
try:
from agw import hyperlink as hl
except ImportError: # if it's not there locally, try the wxPython lib.
import wx.lib.agw.hyperlink as hl
# problem locating eggs folder - solution in http://www.pyinstaller.org/ticket/185
# change pyinstaller-1.5\support\_pyi_egg_install.py
#if os.path.isdir(d):
# for fn in os.listdir(d):
# sys.path.append(os.path.join(d, fn))
import numpy as np
#if hasattr(sys, 'frozen') and sys.frozen:
# import numpy.core.ma
# sys.modules['numpy.ma'] = sys.modules['numpy.core.ma']
# if include matplotlib before sys.path, matplotlib.collections shadows collections and won't find namedtuple
# Currently problem with Path in environment MATPLOTLIBDATA not a directory
# Must put mpl-data folder in same folder as the executable is finally run from
import matplotlib
#import matplotlib.numerix as Numerix
#from matplotlib.axes import _process_plot_var_args
#from matplotlib.backend_bases import FigureCanvasBase
#from matplotlib.backends.backend_agg import FigureCanvasAgg, RendererAgg
#from matplotlib.backends.backend_wxagg import FigureCanvasWxAgg
#from matplotlib.figure import Figure
#from matplotlib.font_manager import FontProperties
#from matplotlib.projections.polar import PolarAxes
#from matplotlib.transforms import Bbox
# connected to matplotlib
# don't exclude Tkinter, Tkconstants
import wxmpl
import pylab # must import after wxmpl so matplotlib.use() is always first
# don't import boomslang - trouble with import pylab in many cases, even import math.
# works fine if matplotlib baked into exe
#import boomslang
# no need to bake googleapi in as nothing installed as such. Just ensure not using stale pycs from Ubuntu system.
#import googleapi
# problem with import os etc if using below
#import googleapi.gdata.spreadsheet.service as gdata_spreadsheet_service
#import googleapi.gdata.spreadsheet as gdata_spreadsheet
#import googleapi.gdata.docs.service as gdata_docs_service
#import googleapi.gdata.service as gdata_service
# no need to bake xlrd in as nothing installed as such. Just ensure not using stale pycs from Ubuntu system.
#import xlrd
import adodbapi
import pywintypes
import win32api
import win32con
import win32com
import win32com.client
import dao36_from_genpy # go to makepy/genpy and look in py files till found - taken and rename and relocate so can directly call
import import2run
The code for SOFA is cross-platform and I start the Windows packaging process by copying everything across from Ubuntu. It is important in such a case to wipe all pyc files so that platform-specific ones are created for Windows and included in the executable creation process.
The final import statement is for import2run.py. This means that the executable doesn’t hardwire anything beyond the imports. As it happens I started by having import2run contain just the following line:
raw_input("Success!!")
Later, once all the basic imports were working, I changed it to:
import start
to actually load SOFA. NB the executable created using the technique described here doesn’t replace all the SOFA modules with a single executable – its purpose is to replace Python and all the extra libraries such as matplotlib. So the exe is expected to live in the main SOFA program folder (usually in C:\Program Files\sofastats) alongside the usual modules such as core_stats.py. If a user actually had Python 2.6 and all the libraries installed they could either use the exe or run start.py directly themselves. It would have the same effect.
Getting matplotlib to work took a while and involved many false leads. In the end the solution was to copy the entire mpl-data folder (from somewhere like C:\Python26\Lib\site-packages\matplotlib) into the same folder as the sofastats.exe was going to end up.
Some final things I learned about Pyinstaller. –onedir is the default and adds the coll = COLLECT(…) part of the spec file. If making manual changes remember that if you want the onedir approach, don’t include a.binaries in the EXE(…) part and exclude_binaries should be True. If, like myself you want a single executable file, don’t bother with coll = COLLECT(…), include a.binaries, and set exclude_binaries to False. And while testing set debug=True and Console=True so you can see what is going wrong as you refine your spec file, launch.py script etc.
Although GUI2EXE is a wonderful program some aspects may not be compatible with Pyinstaller 1.5.1 so I now build my spec file using makespec.py with the –onefile argument. It works in its basic vanilla form for SOFA using launch.py. You can export the spec file GUI2EXE makes and see the differences.
Here is the final spec file I used:
# -*- mode: python -*-
# used MAKESPEC 1.5.1 with --onefile option
# NB must include mpl-data folder under main sofastats level (i.e. sibling of dbe_plugins etc) for matplotlib to work
# manually set level=9 in PYZ params (inspired by how GUI2EXE did it)
# manually replaced name=os.path.join('dist', 'launch.exe'), with name='C:\\sofastats_build_exe\\sofa.main\\sofastats.exe',
# manually set debug=True, upx=False in EXE params
# manually set exclude_binaries=False in EXE params
a = Analysis([os.path.join(HOMEPATH,'support\\_mountzlib.py'), os.path.join(HOMEPATH,'support\\useUnicode.py'), 'C:\\sofastats_build_exe\\sofa.main\\launch.py'],
pathex=['C:\\Python26\\pyinstaller-1.5.1'])
pyz = PYZ(a.pure, level=9)
exe = EXE( pyz,
a.scripts,
a.binaries,
a.zipfiles,
a.datas,
exclude_binaries=False,
name='C:\\sofastats_build_exe\\sofa.main\\sofastats.exe',
debug=True,
strip=False,
upx=False,
console=True )
Before going live switch debug and console to False.
This post is largely specific to SOFA Statistics but hopefully it includes some tips which might save others a lot of fruitless struggle. If you have trouble, I found the pyinstaller mailing list people helpful.
Better installation in non-English environments
November 23rd, 2011Version 1.1.2 fixes a bug which affected people trying to install SOFA into many non-English environments. SOFA also has some changes which make it safe for SOFA to communicate progress in more detail while being run in Windows using the non-console version of Python. Overall, SOFA has become much more robust in recent versions.
SOFA Statistics and Open Source Business – Misc
November 12th, 2011The way ahead for SOFA Statistics from a business point of view is not clear at the moment and I recently wrote about some of the issues and options here: Finding a Viable Open Source Business Model – The SOFA Statistics Experience (so far). This post is a complement to that article and the purpose is to let me store miscellaneous ideas and links of relevance without having to integrate them into a coherent narrative.
Impact of cost of downtime on value of support
In my case we have three steel mills worth $10k+ per hour of downtime… Even more if downtime causes rework. If we have more than an hour down I have vice presidents in my bosses office!
http://linux.slashdot.org/comments.pl?sid=2500906&cid=37888734
App Stores not a silver bullet
See Striking It Rich In The App Store: For Developers, It’s More Casino Than Gold Mine
Hard-nosed Realities
So the HP guy comes up to me (at the Melbourne conference) and he says, ‘If you say nasty things like that to vendors you’re not going to get anything’. I said ‘no, in eight years of saying nothing, we’ve got nothing, and I’m going to start saying nasty things, in the hope that some of these vendors will start giving me money so I’ll shut up’. [Quote supposedly from Theo de Raadt - verify]
Impact of different usage patterns
The conventional wisdom on how a business model works is sometimes completely wrong in a particular case. The freemium model, for example, which I am hoping to use with SOFA, was apparently not going to work for Evernote – except it did (Evernote: Company of the Year):
Evernote was being pitched as a so-called freemium service. In other words, people could either use it for free or upgrade to a paid premium version, which is how the company would make money. So far, so good; the freemium model was seen as a smart one. The problem was that, unlike virtually all other entrepreneurs relying on that model, Libin refused to cripple the free version, removing the incentive to upgrade to the paid version. You could pay $5 a month and get additional file storage, but why would anyone do that? asked the VCs. The free version was full featured and offered generous storage.
Libin explained his theory: The more stuff you put in Evernote, the more important the service would be to you. Who would begrudge $5 a month to a company that was storing your memories and helping you retrieve them? “Your notes, your restaurants, your friends, a year of your life, then years of your life,” says Libin. “That’s worth thousands.” The danger wasn’t that people wouldn’t upgrade, he argued; it was that they wouldn’t try the service in the first place or wouldn’t stick with it because the free version was skimpy and failed to impress. Get them to fall in love with the service, and they would eventually pay, because they would be invested in its success. “I want to build a 100-year company, and I’m serious about that,” says Libin. “I don’t need to squeeze money out of you. I’ll have the rest of your life to take your money. It’s my long-term greedy strategy. Our slogan is, ‘We’d rather you stay than pay.’ Basically, I wanted a business model that rhymed.”
And …
Libin showed the group that the rate at which Evernote users were upgrading to the paid version within a month of signing up was half a percent. This was not good—and not surprising, given that the free version worked fine. But then Libin showed the upgrade rates over longer periods of time. Normally, this would be an even grimmer picture, because at almost all companies with freemium models, users who upgrade tend to do so pretty quickly. They sample the hobbled free version, and if they like it, they upgrade right away to get all the features; if they don’t like it enough to upgrade, they tend to abandon the service altogether or use it lightly. But Libin showed that Evernote users became more likely to upgrade over time. For those users who had been using Evernote for a year, the upgrade rate was an impressive 8 percent. If Evernote could get to a million users, explained Libin, sales would be close to $4 million a year. And, at the current growth rate, Evernote would reach 10 million users within two years.
Then Libin showed activity rates, or, roughly, how often an average user was actually using Evernote over time. For many software companies, that curve runs relentlessly downward. Most people who try an app abandon it pretty quickly or use it less frequently as time goes on. But for Evernote, the curve was a smile. There was a slight drop-off in usage after the first few months, but then it went up again—not only because active users were finding the service more and more useful, but also because customers who had stopped using the service were returning to it. People who left Evernote missed it.
Importance of finding your own business model which works for you
NB the value you add may be around the software, rather than the software itself. Notes that open core is very similar to proprietary models.
FOSS4G 2011 Keynote
Visits-Downloads-Sales Process
One has to be careful about drawing conclusions from a relatively small and unverifiable data set. However the results certainly seem to support the much-quoted “industry standard” sales:visits conversion ratio of 1%. But there are huge variations between products.
The fact that the sales:downloads ratio is both lower on average and more variable than the downloads:visitors ratio implies that getting people to download is the easy bit and converting the download to a sale is a tougher challenge.
The average sales:visits conversion ratio is noticeably higher for Mac OS X products than Windows products. This is supported by anecdotal evidence and the author’s own experience with a cross-platform product. However the number of Mac respondents to the survey is too small for the result to be stated with any great confidence. Also remember that the Mac market is still a lot smaller than the Windows market before you rush off to start learning Cocoa and Objective-C.
The truth about conversion ratios for downloadable software
Desktop Apps – Harder Sales Funnel?
Someone visits your website, downloads your trial, and hopefully purchases your program. That process is called a funnel, and if you break it down into concrete steps, the shareware funnel is long and arduous for the consumer:
- Start your web session on Google, like everyone does these days.
- Google your pain point.
- Click on the search result to the shareware site.
- Read a little, realize they have software that solves your problem.
- Mentally evaluate whether the software works on your system.
- Click on the download button.
- Wait while it downloads.
- Close your browser.
- Try to find the file on your hard disk.
- Execute the installer.
- Click through six screens that no one in the history of man has ever read.
- Execute the program.
- Get dumped at the main screen.
- Play around, fall in love.
- Potentially weeks pass.
- Find your way back to the shareware site. Check out price.
- Type in your credit card details. Hit Checkout.
I could go into more detail if I wanted, but that is seventeen different opportunities for the shareware developer to fail.
Tracking Usage
On the Internet, privacy expectations have evolved a bit in the last few years. The overwhelming majority of the public has been told that they’re being tracked via cookies and could not care less. If you write a privacy policy, they won’t even bother reading it. Which means that you can disclose in your privacy policy that you track non-personally identifying information, which is very valuable as a software developer.
- What features of your software are being used?
- What features of your software are being ignored?
- What features are used by people who go on to pay?
- What combination of settings is most common?
- What separates the power users from the one-try-and-quit users?
Tracking all of these is very possible with modern analytics software like, e.g., Mixpanel. You can even wrestle the information out of Google Analytics if you’re prepared to do some extra work. You can do it in a way which respects your users’ privacy while still maximizing your ability to give them what they want.
The Risk of a Starting a Business/Chasing a Dream
Good news for Mac & Linux users – Excel importing added
October 9th, 2011SOFA Statistics 1.1.1 brings good news for Mac and Linux users. You can now import Excel xls files directly. This is no longer a Windows-only feature.
Here is the full list of changes:
- Excel can be imported from Mac and Linux as well as Windows.
- ODS importing now copes with single ‘divider’ columns – i.e. columns with no field name in the header.
- CSV importing now autofills blank columns with field numbers such as Var018.
- More informative if locale issues.
- More informative if unable to connect to MySQL on Mac.
- Changed standard deviation in report tables from population sd to sample sd.
There is one important set of bug fixes which allows more sophisticated extraction of cell values from ODS spreadsheets. SOFA now copes with formatted content of cells and other complex cases by handling subelements in the XML.
Version 1.1.0 brings it together
August 20th, 2011Version 1.1.0 finally brings it together adding some of the last features to round out the original vision for the application. The main change is much easier access to data – users can now open data tables from anywhere the table can be selected e.g. charts, report tables, statistical analyses.
Another change makes it easier to import from spreadsheets – SOFA now gives a preview of the first few rows of data to make it easier to determine whether there is a header row or not:
The two extra changes are: Importing from Google Doc spreadsheets now automatically starts import process if downloading was successful; Windows users can install into any folder now, not just one with sofastats in the name.
There are also a couple of bug fixes: Fixed bug when trying to display feedback on resizing operation on data table from dialogs other than data select; and fixed regression when running data list report tables.
Here are all the major feature changes since version 1.0 was released:
- Single line charts now have option of a trend line and data smoothing (weighted rolling average).
- Averages can be displayed for most chart types e.g. a line chart of average income by month.
- Attractive and dynamic Box and Whisker plots added.
- Much easier access to data – can now open data table from anywhere the table can be selected e.g. charts, report tables, stats analyses.
- Numerous usability improvements and bug fixes.
I hope you really like it.
1.0.7 Much easier data entry; better support for non-English text
July 28th, 2011It is now a lot easier and more pleasant to enter data directly into SOFA. Check it out and see if you agree. It is also easier to get CSV data in if there are lots of fields. Overall this is an incremental step forwards rather than the introduction of lots of new features. Here is the full list of improvements:
- Much easier and quicker data entry. Return key now functions like Tab in data entry tables. Deleting a cell automatically inserts the appropriate value.
- Much faster importing of csv files with lots of fields. Now has option of quickly checking field names collectively (in batches under the surface) rather than individually.
- Improved feedback to user if problem in early stages starting SOFA. Program now makes an error text file on the user desktop as well.
- All field or table name checks in SQLite now return the SQLite error text as well.
- Better message to user if installation of wx backend for matplotlib missing.
- If cancel process of changing file used to define variable config, report table display no longer reverts to random demo.
and bug fixes:
- Fixed bug in chi square when no labels set for numerical variables. Needed to convert value to unicode before using as label.
- Fixed bug when importing datetimes with ‘T’ as the separator between date and time.
- Fixed bug caused by SQLite queries sometimes returning strings instead of floats when extracting REAL (numeric) data. Fixed it where it affected Row Stats medians and std devs; and statistical tests.
- Fixed bug when uwhisker and lwhisker not set. Also copes better when no boxes are displayed in boxplot.
- Handling Python 2.6 unicode keyword bug.
- Replaced pprint.pformat where it messes up unicode e.g. user paths with non-ascii characters. Misc other changes to fix internal issues.
- Fixed bug allowing None to be displayed in Val A and Val B drop-downs under Group by e.g. ANOVA.
- Config dialog in Report Tables widened slightly when needed to display title.
- Fixed bug when decimal entered into value label list for an integer field.
- Fixed CSV import bug when trying to guess whether a header or not.
Thanks to all the users who helped identify and resolve problems.






