System update spec proposal

Ivan Krstić krstic at solarsail.hcs.harvard.edu
Tue Jun 26 13:55:36 EDT 2007


Software updates on the One Laptop per Child's XO laptop
========================================================




0. Problem statement and scope
==============================

This document aims to specify the mechanism for updating software on the
XO-1 laptop. When we talk about updating software, we are referring both
to system software such as the OS and the core services controlled by
OLPC that are required for the laptop's basic operation, and about any
installed user-facing applications ("activities"), both those provided
by OLPC and those provided by third parties.




1. System updater
=================

1.1. Core goals
---------------

The three core goals of a software update tool (hereafter "updater")  
for the
XO are as follows:

     * Security
     Given the initial age group of our users, it is the only reasonable
     solution to default to automatic detection and installation of
     updates, both to be able to apply security patches in a timely
     fashion, and to enable users to benefit from rapid development and
     improvements in the software they're using. Automatic updates,
     however, are a security issue unto themselves: compromising the
     update system in any way can provide an attacker with the  
ability to
     wreak havoc across entire installed bases of laptops while  
bypassing
     -- by design -- all the security measures on the machine.  
Therefore,
     the security of the updater is paramount and must be its first
     design goal.

     * Uncompromising emphasis on fault-tolerance
     Given the scale of our deployment, the relatively high  
complexity of
     our network stack when compared to currently-common deployments,  
the
     unreliability of Internet connectivity even when available, and
     perhaps most importantly our desire for participating countries to
     soon begin customizing the official OLPC OS images to best suit
     them, it is clear that our updater must be fault-tolerant. This is
     both in the simple sense -- cryptographic checksums need to be used
     to ensure updates were received correctly -- and in the more  
complex
     sense that the likelihood of a human error with regard to update
     preparation goes up proportionally to the number of different base
     OS images at play. A fault-tolerant updater will therefore allow
     _unconditional_ rollback of the most recently applied
     update. "Unconditional" here means that, barring the failure of
     other parts of the system which are dependencies of the updater
     (e.g. the filesystem), the updater must always know how to  
correctly
     unapply an applied update, even if the update was malformed.

     * Low bandwidth
     For much the same reasons (project scale, Internet access scarcity
     and unreliability) that require fault-tolerance from the updater,
     the tool must take maximum care to minimize data transfer
     requirements. This means, concretely, that a delta-based approach
     must be utilized by the updater, with a "keyframe" or "heavy"  
update
     being strictly a fallback in the unlikely case an update path  
cannot
     be constructed from the available or reachable delta sets.



1.2. Design
-----------

It is given, due to requirements imposed by the Bitfrost security
platform, that a laptop will attempt to make daily contact with the
OLPC anti-theft servers. During that interaction, the laptop will post
its system software version, and the response provided by the
anti-theft service will optionally contain a relative URL of a more
recent OS image.

If such a pointer has been received and the laptop is behind a known
school server, it will probe the school server via rsync at the provided
relative URL to determine whether the server has cached the update
locally. If the update is not available locally, the laptop will wait up
to 24 hours, checking approximately hourly whether the school server has
obtained the update. If at the end of this wait period the school server
still does not have a local copy of the update, it is assumed to be
malfunctioning, and the laptop will contact an upstream master server
directly by using the URL provided originally by the anti-theft service.

In any of these three cases (school server has update immediately,
school server has update after delay, upstream master has update), we
say the laptop has 'found an update source'.

Once an update source has been found, the laptop will invoke the
standard rsync tool over a plaintext (unsecured) connection via the
rsync protocol -- not piped through a shell of any kind -- to bring
its own files up to date with the more recent version of the
system. rsync uses a network-efficient binary diff algorithm which
satisfies goal 3.



1.3. Design note: peer-to-peer updates
--------------------------------------

It is desirable to provide "viral update" functionality at a later date,
such that two laptops with different software versions (and without any
notion of trust) can engage in an update to bring the laptop with the
older software fully up to date.

However, determining how to provide this functionality securely,
efficiently and elegantly is not feasible on the Gen1 FRS
timeline. Therefore, laptop-to-laptop updates will NOT be a part of the
updater that ships with the FRS image, and are a candidate for release
2-3 months after FRS.



1.4. Design note: rsync scalability
-----------------------------------

rsync is a known CPU hog on the server side. It would be absolutely
infeasible to support a very large number of users from a single rsync
server. This is far less of a problem in our scenario for three reasons:

     * High branching factor
       In all normal circumstances, the vast majority of the rsync
       traffic to our upstream servers will come from school servers,  
not
       individual laptops.  If school servers are unavailable of
       malfunctioning, it is not the case that there will be a flood of
       requests from individual laptops, because it's likely that the
       school servers are those laptops' only gateway to the Internet.

     * Element of randomness in anti-theft requests
       Instead of hitting the update servers every hour on the hour,
       the laptops are already including an element of randomness in  
choosing
       when to contact the anti-theft service. This random delay  
propagates to
       the rsync requests, as well.

     * In-depth stagger abilities on the server side
       Because notification of new updates is performed by the anti- 
theft
       service which is aware of a laptop's locale, updates can be
       staggered over several days by country, region, or any other
       metric such as server load.

Additionally, some optimizations can be added to rsync proper to aid
with our use case, but such engineering will need to wait until after
FRS.



1.5. Implementation
-------------------

In order to implement runtime file protection, Bitfrost relies on the
COW functionality of the Linux-VServer patchset. The functionality
imbues immutable hardlinks within a designated context with special
meaning: when broken by some destructive file operation, VServer will
replace these hardlinks with the content of the file they were pointing
to and apply the desired operation on the resulting copy.

The XO updater will run in a special context to which the security
service has exposed the entire underlying filesystem as a COW copy. The
updater will update this COW copy in-place with rsync. This COW
mechanism simply ensures no excess authority lies with the updater; any
failures or vulnerabilities in it do not propagate to the rest of the
system.

One file contained within each OS image will be its cryptographically
signed manifest; at the end of the rsync operation, the laptop will have
obtained that file. At this point, the updater will request that the
security service applies the update. Note that due to the nature of
rsync, we can stop and restart the network phase of a single update
several times as connectivity becomes available, and until we've
received the complete update.

The security service will terminate the updater and then analyze the
manifest and confirm the modified files in the updater's context exactly
match the expected OS image end-state. If any discrepancy is discovered,
the updater context will be discarded and the update operation aborted.

If the update is verified to be complete and correct, the security
service will mark it as such, and designate the files within it to be
the files exported into all newly-created containers. System service
containers will be restarted gracefully.  If the the image manifest did
not contain a header identifying that image as a high-priority update,
the update process ends here. Restartable services have been restarted,
and the rest of the system will be initialized from the update on
reboot.

If the update has been marked as high-priority, the user will be asked
to close applications and reboot his machine immediately. A timer will
run that will reboot the machine in 60 minutes if the user does not do
so. The high-priority timer can be disabled in the security center; its
purpose is merely to provide some extra protection to the youngest users
who cannot necessarily be expected to understand or comply with the
reboot request.

On boot, the first initialization script to run will perform a
pivot_root operation to the directory that currently holds the OS image
marked bootable by the security service. With the example above, it
would be the directory that belonged to the updater's context. If a key
is depressed during boot, however, the pivot_root is performed to the
_old_ bootable context, and the user presented a dialog asking whether
she would like to make the rollback permanent.

The kernel is the only special case to this handling: in the event that
a verified update contains an updated kernel, that kernel will be placed
into a predetermined place in the underlying filesystem by the security
service.  OpenFirmware will preferentially boot this newer kernel unless
the rollback key combination is depressed during boot.

Notice that the update operation has been reduced to a simple state
toggle between (any) two OS images. In so doing, we have satisfied goals
1 and 2.




2. Application updater
======================

2.1. Design
-----------

The XO eschews traditional dependency-based approaches to package
management, making application upgrades somewhat difficult. The problem
is compounded by the fact that Bitfrost does not permit applications to
update themselves in-place, which is a common update method on platforms
such as Mac OS X and Windows.

When it comes to application updates, we wish to stay true to our goals
of security and low-bandwidth updates, but are willing to settle for
less fault tolerance as necessitated by the fact that most activities
won't be OLPC-written or maintained.

The design should make it possible to have a single tool that can
ascertain the existence of updated versions of any currently installed
activities, and then fetch and install those updates. It should do so
bandwidth-efficiently, such that files that are unchanged between
activity versions aren't downloaded as part of the update, and also such
that identical resources files packaged by multiple activities are never
downloaded more than once, or not at all if they already exist on the
system.



2.2. Implementation
-------------------

A manifest file is added to the bundle format specification. The
manifest consists of the filename and strong cryptographic hash of every
file in the bundle. Another file is added, called 'origin', that
specifies a URL where updated activity bundles may be found, and a
public key which will be used to sign such updated bundles.

When a global activity update is initiated, the updater enumerates the
origins for all installed activities, then probes each one in turn to
determine which activities have available updates. The resulting
activity list is the 'available update set'.

The most up-to-date bundle for each activity in the set is accessed, and
the first several kilobytes downloaded. Since bundles are simple ZIP
files, the downloaded data will contain the ZIP file index which stores
byte offsets for the constituent compressed files. The updater then
locates the bundle manifest in each index and makes a HTTP request with
the respective byte range to each bundle origin. At the end of this
process, the updater has cheaply obtained a set of manifests of the
files in all available activity updates.

A local database of manifests of all installed activities is kept,
pruned only to records for files larger than a set size, e.g. 50
KB. The updater cross-references each manifest from the available
update set with the installed database, and then with other manifests
in the set. Files which exist locally and are also present in the
available update set aren't downloaded; the updater simply "plants"
the files in the right places. The same happens for identical files
present in multiple bundles in the available update set; they are only
downloaded once.

After a bundle (minus any redundant files) has been downloaded, it is
unpacked and reassembled (if it needs any of the files that haven't been
downloaded because they already exist). Cryptographic signature
verification is performed. If remaining disk space is larger than a
particular margin, e.g. 20%, then the context containing the older
version of the activity bundle is kept around, and the user given the
ability to perform rollback on the activity update. Otherwise, the old
version bundle is destroyed.





:Author
     Ivan Krstić
     ivan AT laptop.org
     One Laptop per Child
     http://laptop.org

:Metadata
     Revision: Draft-14
     Timestamp: Tue Jun  26 17:51:45 UTC 2007


END



--
Ivan Krstić <krstic at solarsail.hcs.harvard.edu> | GPG: 0x147C722D




More information about the Devel mailing list