Need Help

Benjamin M. Schwartz bmschwar at fas.harvard.edu
Tue Mar 4 22:54:18 EST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michael Stone wrote:
| On Tue, Mar 04, 2008 at 08:22:31PM -0500, Benjamin M. Schwartz wrote:
|> Michael Stone wrote:
|> | My central error-handling goal has been to compactly express my
|> | assumptions in a form that will prevent them from being violated in
|> | ignorance. Should I have different goals?
|>
|> 1. I find Rainbow very impressive, and I am sure you are well aware of the
|> various arguments made regarding error handling.
|
| Thank you. While it's true that I'm aware of some arguments regarding error
| handling, I'm always interested in improving. It seems like one of the
| most regularly failed challenges in the craft of programming.
|
|> In my view, restricting assertions to internal invariants provides an
|> easy way of distinguishing problems in Rainbow from problems in
|> Activities and other parts of the system.
|
| True, but the convention that I have established of separating error
| messages into contract-violations and 'everything else', recorded in
| per-activity logs and in a daemon-wide log (/var/log/rainbow) would seem
| to accomplish similar goals.

I have not read the relevant Rainbow source, so I cannot comment very
intelligently on this.  However, if Rainbow wishes to log a contract
violation, it should insert the phrase "contract violation" into the
logfile.  Otherwise, how is a person reading the log to know this?

|> 2. Among your goals, you might consider maximizing the ability of novice
|> programmers to figure out what they've done wrong.
|
| It's not my primary goal, but I'll agree that it's worth considering.
|
|> The wiki page on translation even goes so far as to
|> recommend using gettext for error strings, so that users and
|> administrators may debug the system without knowing English.

I used the phrase "debug the system".  That was a poor choice.  I should
say "recognize bugs in the system", and additionally "distinguish between
bugs in the system and bugs in the activities they're developing".

|
| I'm still not convinced. Wouldn't we be better served by translating the
| source code itself, or an overview of the source code like my 'Taste the
| Rainbow' pages?
|
| Consider: in my experience, debugging consists of searching the diff
| between one's mental model and reality from which it follows that the
| material which should be translated is the material which provides the
| clearest, most accurate mental model of the problem.

Your experience is extremely unusual and non-representative.  You are an
expert computer scientist who frequently reads source code written by
others.  You are familiar with the OLPC operating system details,
including D-Bus and the Bitfrost requirements, perhaps moreso than anyone
else in the world.

The people who will be reading these logfiles will be developers who are
trying to debug their activities.  The activity may have crashed because
it attempted to violate a Bitfrost rule and was killed by Rainbow.  These
developers (ideally mostly children) will likely be building their
activities by making small modifications to existing activities.  That
means most won't even understand their own code.  How could you possibly
expect them to understand yours?

| Also consider: had there been an actual bug in Rainbow, which would have
| been more useful to Waqas in diagnosing and fixing the problem:
| translated error messages or better written or documented source code?

Not fixing.  It is absurd to imagine that any appreciable number of users
will be able fix Rainbow bugs.  Rather, when Rainbow experiences an
internal error, it should be extremely obvious that the problem is with
Rainbow.  For example, an excellent type of behavior would be for Rainbow
to print, in the logfile:

RAINBOW BUG: Rainbow has encountered an internal error.  This indicates a
bug in Rainbow.  The error code is 752.

This line would be sufficient for activity developers to understand that
the problem is not simply in their code. It also makes it possible for
users to participate usefully in the development process, by reporting the
bug in an unambiguous way.  Error codes are also important because they
allow users to identify problems even when e-mailing logfiles is
impossible due to software bugs or lack of connectivity.  This error line
is also nice because it only needs to be translated once, with the error
code number substituted programmatically.

This output could be improved further by adding an additional sentence,
such as:

This error code indicates that Rainbow's directory permissions have
reached an inconsistent state.

This line, like a BSOD, serves mainly to make users feel like the system's
designers want them to know what's going on in case of a failure.
However, the implementation overhead is undeniably high, especially given
the need for many translations.  On the plus side, these strings also
serve as documentation when reading the source code.

|
| Put another way, doesn't this kind of error message uselessly duplicate
| information that is best recorded in the failing assertion itself (and
| in the name of the function containing it, in this case,
|
|   check_cwd(... [cwd=]/home/olpc/Activities/Qirat.activity)
|       assert ck.negative(W_OK, 0)
|
| ?

I have no idea what any of those names mean, despite having looked at the
source.  I can now guess that "cwd" means "current working directory" and
"ck" means "check", but I still have no idea what the code actually does.
~ Reading code is hard, and you should never expect anyone to do it unless
they are planning on modifying that code.

|
|> 3.  Did this assertion failure result in the termination of the Rainbow
|> daemon?
|
| The present implementation calls clone() before executing any
| activity-launching code. Termination of the child by failure to handle
| the AssertionError is a design goal.
|
|> Raising exceptions for input errors has the distinct
|> advantage of allowing one to catch exceptions thrown further down the call
|> stack, instead of exiting.  Note that when I say "specific exceptions", it
|> would be perfectly reasonable to wrap up all errors due to permissions in
|> a PermissionsException, etc.
|
| First, what can I reasonably expect to accomplish by catching such an
| exception?

You can print a sensible error message, such as "The current activity
(Qirat) could not be launched because the permissions on its bundle
directory are insecure."

| Second, given that the exception is being raised in a child
| process that may have been compromised by malicious data, I'm not
| terribly interested in informing the main daemon to the particulars of
| the failure; the log file is quite sufficient for my purposes.

I agree; there is no need to send information up to the main daemon.  I
think specialized exceptions make it easier to achieve informative logfiles.

- --Ben
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHzhlqUJT6e6HFtqQRAk4pAJ9xab+6sXvc6RSOqBLFkalBo4UFtgCff5B6
HJg89MaTolZ9rPryVhyzOAU=
=DGoW
-----END PGP SIGNATURE-----



More information about the Devel mailing list