[sugar] DS possibilities
Benjamin M. Schwartz
bmschwar at fas.harvard.edu
Wed Apr 23 15:42:17 EDT 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Reports from the field, especially from Carla and Bryan, have indicated
that the datastore can get into a corrupted state from which it cannot
recover. The corruption persists over a reboot. After corruption,
subsequent datastore calls usually raise exceptions, which are not handled
by the Activities (including the Journal) and so no Activities will load,
and Sugar is unusable.
Fixing this behavior before the next release is clearly a very high
priority. We have considered many possible strategies, and we must now
determine which of them we will pursue. There is no need to pursue only
one; most are not exclusive. Personally, I expect that more than one will
be undertaken.
Actions:
Make Activities more robust to DS failures.
By handling DS exceptions correctly, Activities might fall back into a
no-datastore mode, or a read-only mode. This may require modification of
sugar.activity.activity, as well as many Activities' codebases.
Pro:
Reduces the severity of DS failure and corruption
Improves user experience immediately
Con:
A tremendous amount of work to modify every piece of code that use the
datastore's API.
If the DS is not working, then the system is of very limited utility, if
nothing can be saved, or nothing can be read. Thus, this resilience
provides only a very small benefit.
The DS is as critical as the filesystem, memory manager, or process
scheduler. It should absolutely never fail, and creating contingencies
for its failure sets entirely the wrong perspective.
Improve logging.
By logging the details of what the DS, and the rest of the system, is
doing, we might be able to determine what has gone wrong. For example, we
might notice that DS corruption occurs due to OOM killing the DS process,
or due to the system shutting down unexpectedly after loss of power, or
due to removing a usb stick without unmounting it.
Pro:
Lets us know what's going on so we have a chance of fixing it.
Improves possibility of reproducing it here.
Con:
Doesn't solve the problem.
Potentially slows down the system.
We already have a lot of logging.
Notify the user when things go wrong.
If the DS is broken, we may attempt to improve the user experience by
presenting the user with a "blue screen" or other notification indicating
what has happened. This might also improve debugging, since users would
be more likely to send bug reports (and logs) back for analysis.
Pro:
Improved user experience beyond silent failure.
Suggests that the developers are at least smart enough to recognize how
the system can fail.
Improves the probability that we will get back logs after failure.
Con:
The DS is as critical as the filesystem, memory manager, or process
scheduler. It should absolutely never fail, and creating contingencies
for its failure sets entirely the wrong perspective.
Set up a datastore test system.
These datastore bugs still exist largely because they have not been
reproduced in any systematic way. If we find, perhaps by forcing the
datastore to crash on command, that we can reproduce the corruption, then
we may begin to fix it.
Pro:
Improves the ability to fix the system
Improves the ability to determine that the system is reliable.
Con:
?
Fix the current datastore implementation.
By improving the ability of the datastore to avoid corruption, and
increasing its resilience to corrupted files, we may solve this problem.
The solution may be complete, or it may be incomplete, depending on
whether there are issues of race conditions or atomicity that cannot be
fixed in the current implementation.
Pro:
Least likely to introduce new bugs
Possibility of fixing the existing bugs
Con:
The current datastore doesn't support versioning, and will eventually be
thrown away entirely. Thus, doing more work on it is annoying.
The system relies on Xapian, which is likely responsible in part for the
corruption, and is much more difficult to fix since it is taken directly
from upstream.
The problems may require drastic alterations to the current design.
The current datastore design is deeply incompatible with backups due to
the potential for simultaneous access to a single file.
Write a new datastore implementation to the same API.
Given the complexity of the current datastore, and its dependence on
Xapian, it may be very difficult to remove the corruption bugs. Instead,
we may consider writing a new datastore that uses a much simpler backend
to implement the same API. See
http://dev.laptop.org/git?p=users/mstone/sds;a=summary .
Pro:
Drastic simplification is possible, and could greatly reduce the chance
for bugs to appear. Simplicity is extremely valuable.
A simpler datastore design might also be more amenable to easy backup
solutions.
Con:
This code would have to be written and then thrown away, since it does not
help us to reach our goal of versioning.
The drastic simplification would presumably come at the cost of indexing
performance, which might become much slower.
Many feel that new code inevitably implies more bugs.
Write a new datastore implementation with a new API.
Given that we not only wish to remove the bugs in the datastore but also
entirely redesign its implementation and featureset, we may choose to
write an entirely new datastore that exposes an entirely different API.
See http://dev.laptop.org/git?p=users/cscott/olpcfs1;a=summary ,
http://wiki.laptop.org/go/Olpcfs .
Pro:
Gets us where we really want to go.
Con:
A hard problem that will not be solved so quickly.
More complex to debug due to its new, experimental features.
If the current API is not supported, then much of the system and
activities would have to be rewritten.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFID5EZUJT6e6HFtqQRAs5oAJ4yYtjU0SNvfH+4DL47ycVaYMg+zACfd8W/
W5bus/OwMG34LvRa6OqqWTQ=
=YbAG
-----END PGP SIGNATURE-----
More information about the Sugar
mailing list