[Server-devel] [Sugar-devel] The quest for data

Mon Jan 6 03:28:04 EST 2014

On 3.1.2014 04:09, Sameer Verma wrote:
> Happy new year! May 2014 bring good deeds and cheer :-)
> 
> Here's a blog post on the different approaches (that I know of) to data
> gathering across different projects. Do let me know if I missed anything.
> 
> cheers,
> Sameer
> 
> http://www.olpcsf.org/node/204

Thanks for putting together the summary, Sameer. Here is more information about
my xo-stats project:

The project's objective is to determine how XOs are used in Nepalese
classrooms, but I am intending for the implementation to be general enough, so
that it can be reused by other deployments as well. Similarly to other projects
you've mentioned, I separated the project into four stages:

1) collecting data from the XO Journal backups on the schoolserver
2) extracting the data from the backups and storing it in an appropriate format
for analysis and visualization
3) statistically analyzing and visualizing the captured data
4) formulating recommendations for improving the program based on the analysis.

Stage 1 is already implemented on both the server side as well as the client
side, so I first focused on the next step of extracting the data. Initially, I
wanted to reuse an existing script, but I eventually found that none of them
were general enough to meet my criteria. One of my goals is to make the script
work on any version of Sugar.

Thus, I have been working on process_journal_stats.py, which takes a '/users'
directory with XO Journal backups as input, pulls out the Journal metadata and
outputs them in a CSV or JSON file as output.

Journal backups can be in a variety of formats depending on the version
of Sugar. The script currently supports backup format present in Sugar versions
0.82 - 0.88 since the laptops distributed in Nepal are XO-1s running Sugar
0.82. I am planning to add support for later versions of Sugar in the next
version of the script.

The script currently supports two ways to output statistical data. To produce
all statistical data from the Journal, one row per Journal record:

    process_journal_stats.py all

To extract statistical data about the use of activities on the system, use:

    process_journal_stats.py activity

The full documentation with all the options are described in README at:

https://github.com/martasd/xo-stats

One challenge of the project has been determining how much data processing to do
in the python script and what to leave for the data analysis and visualization
tools later in the workflow. For now, I stopped adding features to the script
and I am  evaluating the most appropriate tools to use for visualizing the data.

Here are some of the questions I am intending to answer with the visualizations
and analysis:

* How many times do installed activities get used? How does the activity use
differ over time?
* Which activities are children using to create files? What kind of files are
being created?
* Which activities are being launched in share-mode and how often?
* Which part of the day do children play with the activities?
* How does the set of activities used evolve as children age?

I am also going to be looking how answers to these questions vary from class to
class, school to school, and region to region.

As Martin Abente and Sameer mentioned above, our work needs to be informed by
discussions with the stakeholders- children, educators, parents, school
administrators etc. We do have educational experts among the staff at OLE, who
have worked with more than 50 schools altogether, and I will be talking to them
as I look beyond answering the obvious questions.

For visualization, I have explored using LibreOffice and SOFA, but neither of
those were flexible to allow for customization of the output beyond some a few
rudimentary options, so I started looking at various Javascript libraries, which
are much more powerful. Currently, I am experimenting with Google Charts, which
I found the easiest to get started with. If I run into limitations with Google
Charts in the future, others on my list are InfoVIS Toolkit
(http://philogb.github.io/jit) and HighCharts (http://highcharts.com). Then,
there is also D3.js, but that's a bigger animal.

Alternatively or perhaps in parallel, I am also willing to join efforts to
improve the OLPC Dashboard, which is trying to answer very similar questions to
mine.

I am looking forward to collaborating with everyone who is interested in
exploring ways to analyze and visualize OLPC/Sugar data in a interesting and
meaningful way.

Cheers,
Martin