This is an old revision of the document!
We are trying to identify problems or gaps in SOFA Statistics which make it unsuitable for some users. If you don't mind our asking, what were you wanting from SOFA that was lacking? Or was something else a better fit? Please don't spare our feelings - we really need to get this right.
Sorry - not at this stage, but there could well be something like that in the future. There are other open source options you may wish to consider in the meantime. See http://www.openstatistics.com and http://en.wikipedia.org/wiki/List_of_statistical_packages#Open_source.
The longer-term plan is for SOFA Statistics to support plug-ins for a multitude of extra tests. SOFA will supply the data, the variable and value labels, the data type information, and the reports in which to display the output. The plug-in will only have to comply with a simple interface for both input and output. If the plug-in approach attracts developers and statisticians, SOFA should be able to support what you are looking for. The emphasis will remain, however, on making it easy for people to use the most common tests without being overwhelmed.
“I think that the biggest mistake people make is latching onto the first idea that comes to them and trying to do that. It really comes to a thing that my folks taught me about money. Don’t buy something unless you’ve wanted it three times. Similarly, don’t throw in a feature when you first think of it. Think if there’s a way to generalize it, think if it should be generalized. Sometimes you can generalize things too much. I think like the things in Scheme were generalized too much. There is a level of abstraction beyond which people don’t want to go. Take a good look at what you want to do, and try to come up with the long-term lazy way, not the short-term lazy way.” — Larry Wall - Creator of the Perl programming language
Hi Nick,
I thought I would send a bit of encouragement on the concept. I can see the point of providing a consistent interface that works irrespective of whether the datastore is a database or a spreadsheet, or whatever. I wouldn't use it for Django, but it would have a home in data analysis projects.
We sometimes have to deal with the tedious problem of sucking data out of spreadsheets and into either R or python. As an example of an only tangentially related project that aims to go in the other direction (in this case from spreadsheets into R), Jimmy Oh has been working on the prosaically named TableToLongForm:
https://www.stat.auckland.ac.nz/~joh024/Research/TableToLongForm/TableToLongForm_A1Poster.pdf https://github.com/joh024/TableToLongForm
Ed [(Dragonfly)]
From cesar.rabak@gmail.com
A fact of life is the propensity of an analysis of some data reveal some information which when it is seen in the aggregated/summarized/analytical form brings questioning from the supplier of the data which may lead to a correction of the data initially input.
In fact most methodologies have a “data validation” or “preliminary review” where an opportunity to address this is offered to the client¹
The most advanced solution for this has been developed by the team of R software, where they addressed this as “reproducible research”, meaning you create a document whose parts from statistical analyses are linked in such a fashion that if you run some form of automatic procedure the report is regenerated automatically, and if data changes the changes propagate through the report thus updating and correcting it.
Obviously this reduces the chance of introducing errors be it due omissions of corrected tables/graphs as well as eases the versioning of the document (the deliverable).
For an overview, from the R pov, please see: http://cran.r-project.org/web/views/ReproducibleResearch.html, see also the “related links” for more ideas on this.
In a nutshell, most of these solutions have means of introducing special chunks of information between a specific set of marks which are treated, initially, by the word processing tool as a form of comment or macro and a post processing of the document replaces the chunks with the results of analyses coded in the chunks, creating the deliverable.
Now my opinion wearing my dev hat: since SOFA is Python based and there is Python bindings for use in Microsoft Office and Open/Libre Office applications, I do believe a similar solution could be developed to allow SOFA to produce similar results.
Ability to run analyses against a random sample of a large dataset e.g. 5% or some arbitrary percentage. Make it a filter option.
The logical place to access it would probably be the filter dialog. Implementing it would take a bit of thought because of the massive performance differences likely between different strategies. Making some sort of temporary table seems a potentially good strategy, possibly into the database we are working with e.g. MySQL#. It could also be possible to combine filters e.g. I might want to look at gender = 1 for a 5% sample. It would make sense to apply the sampling filter first (resulting in a one off temporary table-making process) and then apply any filters onto that for specific analyses just as we would off a normal, unsampled table.
# Making a temporary table into SQLite could also work (see http://www.sqlite.org/tempfiles.html) but SOFA would need to reset the dbe, con, cur etc information whenever that table was selected. Could be trivial but it would pay to be careful.
http://www.carlj.ca/2007/12/16/selecting-random-records-with-sql/
My instinct is towards letting the SQL database engine do the heavy lifting for which it is optimised.
SELECT Field1, ..., Field2 FROM Table1 ORDER BY Random() LIMIT 1
It's called a “power” trendline and the equation is: y = E1-10×23.83
The R2 value (whatever that is) is 0.866.
You can choose Exponential, Linear, Logarithmic, Polynomial or Moving Average trendlines but I chose “power” purely because it gave the most pleasing curve…
Perhaps add a button which says use values from a selected variable. Shows user the labels and they can apply them. Can modify and save or just save.
Here is a scenario (please feel free to make your own - I'm a bit rusty on this):
Favourite cars - can choose multiple. Each question is stored as a separate variable.
Three voters: Marco, Frida & Henry
OK so what we want is to be able to provide a short label e.g. “Car”, a long label e.g. “Favourite Sports Car”, a unique name for the multiple response configuration e.g. “favcar1”, and a table with N and % for each car, total N, and a comment.
Favourite Sports Car Car N % Ferrari 2 25.0 Porsche 1 12.5 Lamborghini 1 12.5 McLaren 3 37.5 Aston Martin 1 12.5 8 + + Percentages will add up to greater than 100% when there are multiple responses
The configuration would be stored in the appropriate project vdt (variable details) file. Multiple response configurations might look like this behind the scenes (in a plain text file written in Python):
multresp = {u"favcar1": {u"shortname": u"Car", u"longname": u"Favourite Sports Car", u"vars": [u"q1a", u"q1b", u"q1c", u"q1d", u"q1e", u"q1f"]}, u"favpizza": {... etc}, }
It would also be nice to be able to do bar charts of multiple response sets - of either N or %.
e.g. (the real one would be much better looking of course!):
Favourite Sports Car x x x x x x x x Ferrari Porsche Lamborghini McLaren Aston Martin
I think the easiest way to give access to multiresp sets would be in the drop-down lists of variables. Now variables are table-specific - perhaps multiresp sets should be as well. There would be no point having the option of selecting “favcar1 (multiple responses)” if the current table didn't have the relevant variables in it.
Hmmmmm - I wonder if I should leave variable details project-wide or make them table-specific as well. There are pros and cons both ways. There may also be ways of indicating whether a variable detail is project-wide or table-specific (default).
gabrielsalas@gmail.com (Gabo, Gabriel Salas)
It's more common to be in front of a multiple response set, where you don´t know which are going to be the answers, and besides that, a co-dependency with another previous question. This is what I mean:
Keeping up with the cars example:
The first scenario is 'CARS OWNED' with a preliminary list of 65 brands, so, you cannot preset them as individual questions. Assuming a maximum of 5 mentions per informant, you'll still have:
But now, we can't say that q01a will always have 'FERRARI' as response, actually, q01a could have any of the 65 preliminary brands, or even new brands, so its necessary to be able to add a 'val' list as well…
(I don't know any Python, but I think this would be the idea…)
multresp = {u"favcar1": {u"shortname": u"Car", u"longname": u"Cars owned", u"vars": [u"q01a", u"q01b", u"q01c", u"q01d", u"q01e"], u"vals": [1: u"Ferrari",2: u"Porsche",3: u"Lamborghini" ]}, u"favpizza": {... etc}, }
And know, in this case, I can handle all the mentions that I need.
For this scenario, using same info you provided, I will have the exact same results:
Three voters: Marco, Frida & Henry
(For all cases, q01d and q01e are empty)
CAR OWNED Car N % Ferrari 2 25.0 Porsche 1 12.5 Lamborghini 1 12.5 McLaren 3 37.5 Aston Martin 1 12.5 8 * * Percentages will add up to greater than 100% when there are multiple responses
There is also an even more common scenario:
'PREFERRED CAR' - single selection - q02
'REASONS OF PREFERENCE' - multiple selection - q03a thru q03e
Once again, in q03a thru q03e, I'm going to have a lot of different answers, and in this case, I can't even think about a preset list… so, it's a completely open question.
So, you will have
REASONS OF PREFERENCE Car N % Ferrari Beauty 2 100 Fashion 1 50 Speed 2 100 Style 1 50 Total 2 * McLaren Performance 1 100 Speed 1 100 Tradition 1 100 Total 1 * * Percentages will add up to greater than 100% when there are multiple responses
Basically, both scenarios depend on being able to add 'Value Labels' to each question so, there's no dependency between position and value.
With this last scenario, the chart will also change from the common one, because, instead of making a 'TOTAL REASONS' chart, it should be able to graph up, nesting the results as the way is nested on tables, getting 2 charts for the same question.
Now, talking about the way to give access, i think a dropdown list would be ok, but enabling multiple selection as well, so if the multresp set you are gonna create has 15 variables, you can click on the 15 vars, without having to open the dialog again. Also, out of personal experience (jejeje) the chance of removing one or more vars from the list at the same time, without having to reopen the dialog again, it happens a lot that when adding multiple vars to a list, you end up picking more than needed
I really think that 'variable details' should be table-specific, because actually, even though I always have a ver called AGE(group) in all my projects, the groups perse are not the same in any one of them, even 'GENDER' because sometimes 1: u'MALE', 2: u'FEMALE' and sometimes is blackguards, so, I think at least, there should be an option, where you can decide using table-specific or global. (for me, global it's not an option ;P )
Original user suggestion: Copying and pasting from spreadsheet into SOFA. For example, it is very convenient to copy data from a spreadsheet that you’ve created in software like excel because those are very nice for data arrangement and input. Simple as that, copy column and paste it into sofa. In this way it is possible to create table in few click by combining the power of data listing in for example excel and statistical calculations in sofa.
Building blocks for implementing copy and pasting:
The following looked interesting too:
http://ginstrom.com/scribbles/2008/09/07/getting-the-selected-cells-f...
http://wiki.wxpython.org/wxGrid#Selection_Set_Management
From: http://www.blog.pythonlibrary.org/2010/03/18/wxpython-an-introduction...
And this project: https://sourceforge.net/projects/pywxgridedit/. From the site, it “Enhances wx.grid to provide functionality for cut/ copy/paste and undo/redo. Implemented as mixin.”
Here is the most recent code (about 6 months old): http://pywxgridedit.svn.sourceforge.net/viewvc/pywxgridedit/trunk/pyw...
When running a frequency table, have option of running frequencies on auto-generated bins.
A quicken/QIF importer could be a nice feature for the future
http://allmybrain.com/2009/02/04/converting-financial-csv-data-to-ofx-or-qif-import-files/
The best option is probably the on which uses WebKit to render and then turns the (Webkit-guaranteed perfect) result into PDF:
User suggestion (Marek Paweł Karczewski marek_karczewski@yahoo.com.au): Maybe you could make an option: “Export charts to separate files” and “Export tables to separate files”. Then save each table or file to a separate file, preferably in pdf format. For instance me; I have written a thesis in latex, using lyx. I needed to show a couple of charts. Lyx likes the input format of graphics to be PDF.
Also be aware of alternatives:
Gael Varoquaux (core developer for Mayavi2) has developed a very helpful ad-hoc implementation of “pyreport”.
pyreport is a program that runs a python script and captures its output, compiling it to a pretty report in a pdf or an html file. It can display the output embedded in the code that produced it and can process special comments (literate comments) according to markup languages (rst or LaTeX) to compile a very readable document.
It can also embed pretty plots from captured matplotlib output within the compiled document. Really, quite neat:
Hi Kris,
OK - I'll start with 10g.
Re: testing I don't mean anything fancy - just a simple python script something like this:
import cx_Oracle
... ...connect() def get_tbls(...) .... print("Tables: %s" % tbls)
get_tbls()
def get_dbs(...) .... print("Databases: %s" % dbs)
get_dbs()
At this stage the task is simply to see if I can flesh out the SOFA Statistics database plug-in API.
All the best, Grant
On 27/01/10 04:49, Kris Bravo wrote:
If there are any issues to avoid you could start with 10g express, 10g
cx_Oracle then work out any kinks with switching python libraries.
Other than that, it looks like a straight forward process.
Regarding tests, I haven't look at your existing test cases much, but
if I find any issues connecting to live data I'd like to give you some
sort of script or test case that you can repeat. Do you have some
thoughts on this? A UI runner, or integration test stack?
Kris
On Jan 24, 8:41 pm, Grant Paton-Simpson<gr…@p-s.co.nz> wrote:
> Hi Kris,
Thanks for the feedback. Would the following be a good strategy:
1) I install 10g express edition for testing
2) I install the 11g version of cx_Oracle
3) I start building a module called test_oracle.py which starts fleshing
out the parts behind the SOFA Statistics database plug-in API e.g.
listing available databases, tables, fields etc. You help me test them
(very simple process) as we go.
4) I build a working version of dbe_oracle.py which is automatically
included in the default SOFA Statistics installation packages
5) Testing occurs and then the module is included in the next available
release
Feel free to suggest any changes.
All the best, Grant
Kris Bravo wrote:
Hi Grant,
It is pretty common to connect to databases spanning the different
versions. I checked my local installation and it is the 10g XE install
- You should be pretty safe developing against that. The java drivers
tend to support lower versions so hopefully the python library you
pointed out has the same trait.
Regarding actual installations, the user would only have a single
version installed. They tend to occupy a lot of ports and service
names.
More later when I get a free moment.
Kris Bravo
On Jan 24, 3:29 am, Grant Paton-Simpson<gr…@p-s.co.nz> wrote:
Hi Kris,
Excellent - I will start doing some experimenting. Please forgive me if
I ask a few “dumb” questions along the way about Oracle databases,
typical Oracle database scenarios etc.
Q1: would many users run more than one version of Oracle e.g. 10g and
11g? Is it even possible to do that?
I am hoping not. One option is to let users decide which version of
cx_Oracle to install during the SOFA Statistics installation. I get the
feeling that the different versions of cx_Oracle are not meant to
co-exist but I would love to be proven wrong.
Here is something which suggests you get cx_Oracle for one version only:
Just go to http://www.computronix.com/utilities.shtml and download a
Windows installer or a Linux RPM file matching your versions of Python
and Oracle.
(http://www.ioug.org/python.pdf)
Ideally, the 11g version would handle all versions below it but I
haven't found any documentation to that effect. If it does, I will only
ever have to install one and not bother the user with any decisions. If
not, users will only be able to use the version they have selected at
installation. Unless there is some way of installing multiple cx_Oracle
modules alongside each other and calling them separately. The standard
approach is to:
import cx_Oracle
which doesn't indicate the ability to select version. I will need to do
some experiments on this. And to do that I will need at least one
Oracle database to test on. Which leads to …
Q2: I have found a link to the 10g express edition. But no luck with an
11g version. Is that likely to exist? Would it surprise you as an
Oracle user if there were no 11g express version?http://www.oracle.com/technology/products/database/xe/index.html
I am probably only going to be able to test against the 10g version
myself and rely on others for feedback on the 9i and 11g versions.
Any feedback/advice etc from Python users with Oracle experience is welcome.
All the best, Grant
Kris Bravo wrote:
Hi Grant,
Our environment is Oracle 11g.
I fetched a copy of the source to catch up on how you have the modules
organised and to help test. We're using a variety of Oracle
deployments here ranging from local XE installs to RACs with failover.
I ought to be able to provide some good feedback.
Kris
On Jan 23, 9:47 pm, Grant Paton-Simpson<gr…@p-s.co.nz> wrote:
Hi Kris,
You're right - I haven't made an Oracle module yet (all available
modules are in the dbe_plugins subfolder). Can I conclude from your
question that you use Oracle yourself? If so, which version(s)? And
would you be able to answer some basic Oracle questions for me
(orientation rather than detailed technical questions). And possibly
help me do some basic testing of a new module? Or would anyone else on
this group be able to or know anyone who could? As a first step I would
try to get some standalone functions working which list tables, list
fields etc. Then I would put it all together in a dbe_oracle.py module
ready to just plug into SOFA Statistics (or possibly, separate
dbe_oracle_9i.py, dbe_oracle_10g.py, and dbe_oracle_11g.py modules if
necessary). Anyway, here are a couple of useful links for future
reference:
page:http://cx-oracle.sourceforge.net/ - has separate installers
by Oracle version.http://cx-oracle.sourceforge.net/html/index.html
My immediate priority at the moment is to get a Mac installer package
built for SOFA. But I am happy to start the ball rolling on an Oracle
database plug-in if I can get a little bit of advice e.g. which versions
to cover (9i, 10g, 11g).
All the best, Grant
Kris Bravo wrote:
I looked for some indication that Sofa supports connecting to an
Oracle database but couldn't find anything. Is this a feature that's
available/in the works/scoped for the future?
Kris
Workbook
get_sheet_names
getsheet
Worksheet
get_data_rows_n
get_fld_names
iter
Ubuntu has python-xlrd so make it a dependency in the Deb. Add Windows installer for xlrd. What about Mac?
xlrd Author: John Machin Licence: BSD
import xlrd book = xlrd.open_workbook("myfile.xls") print "The number of worksheets is", book.nsheets print "Worksheet name(s):", book.sheet_names() sh = book.sheet_by_index(0) print sh.name, sh.nrows, sh.ncols print "Cell D30 is", sh.cell_value(rowx=29, colx=3) for rx in range(sh.nrows): print sh.row(rx) # Refer to docs for more details. # Feedback on API is welcomed.