[OLPC Networking] Re: NTP core dump

Tue Apr 18 15:46:04 EDT 2006

Boy, you really have become a serious time geek, Hal...

On Tue, 2006-04-18 at 01:33 -0700, Hal Murray wrote:
> Most of the NTP work comes from Dave Mills, udel.edu.  He does the core 
> science himself, but there is a cloud of volunteers that handle all the 
> details, including making it work on Windows.  I think some of them were 
> Dave's grad students.  The distribution is currently hosted at 
> www.ntp.isc.org.

I think Linux uses the Mills code.  There seems to be a Version 4.2 in
the code; that makes it pretty up to date, I believe.

> It's included in most of the *nix distributions.  Some have fairly old 
> versions.  Some have done some reasonable patches to make it run as non-root. 
>  I haven't seen a clean portable version of that corner.
> 
> There are a couple of alternate implementations of the basic protocol.
> 
> One is from Microsoft.  It's based on SNTP, a striped down version of the 
> protocol that uses the same packet formats.  As usual, Microsoft even managed 
> to botch that.  I don't remember details.  I think they are cleaning it up, 
> at least the horrible botches.

Why am I not surprised?  Sounds like other implementations they've done
of other protocols.

> 
> Another is OpenNTP.  I think it comes from some people associated with 
> OpenBSD.  I'm not sure why they decided to reinvent the wheel.  They "fixed" 
> a few things with the expected results.

They *like* reinvention, usually in the name of security.  They "fixed"
ssh to try to use the broken X security extension, and then had to add
yet another option to turn it off (since later X extensions lack support
for the extension).

So this is just par for the OpenBSD course.  If they talked to people
first, they might get less egg on their faces, but that isn't their
style.

> 
> There are also various roll-your-own versions in things like routers, 
> including a couple of horrible fuckups.  (more below)  I think most Cisco 
> routers include an implementation.
> 
> There is a sntp program included with the main/Mills ntp package.  It doesn't 
> do the drift corrections.  It's basically a replacement for ntpdate which has 
> become a pain to maintain.  ntpdate just smashes the clock to the right 
> value.  It was used for debugging and for setting the clock at boot time 
> before starting ntpd and sometimes from cron jobs if ntpd itself didn't do 
> what you wanted.  ntpd now includes a command line switch to do the 
> initialization that ntpdate was used for, so that reason for supporting 
> ntpdate is now history.  (I'll say more if you want.)
> 
> ----------
> 
> Current NTP technology requires a manual step on setup/installation to 
> specify the servers.
> 
> Are you familiar with the UWisc mess?  Linksys hardwired ntp.uwisc.edu into 
> their routers, then shipped millions of them.  Here is a very good writeup:
>   http://www.cs.wisc.edu/~plonka/netgear-sntp/
> It's long (10+ pages), but very well done, and some of it can be skimmed.  
> I'd call it required reading for anybody interested in networking.

Yes, I've seen the full fiasco, and DLink seems to have done a similar
screw up; I've sent flames at DLink on general principles.

Chris, could you please try to make sure we don't fuck up this way.

> 
> Aside from overloading the server with simple numbers, Linksys had a bug in 
> their code that retransmitted once-per-second if it didn't get a response.  
> That turned into a disaster, positive feedback, shoot yourself in the foot 
> and take out UWisc while you are at it.
> 
> Please make sure you don't do anything dumb like that.  There are a couple of 
> similar events, one in progress.  Do you see RISKS?  If not, I'll send you a 
> copy when I write it up.

I don't watch RISKS, I probably should.

> 
> 
> I think the right way for most users to configure NTP is to specify the 
> servers via DHCP.  The DHCP RFCs have allocated a slot for NTP, but I don't 
> think anybody does this yet.  Or at least not much.  Maybe some distribution 
> rewrites /etc/ntp.conf, the normal parameter/configuration file for ntpd.

No, I suspect that mDNS is a better way these days.

DHCP doesn't even exist in the conventional sense in IPv6. (though there
are some things like it).

> 
> There is a chicken/egg mess in here.  Most ISPs don't run NTP servers for 
> their customers so they don't have any IP Addresses to plug into their DHCP 
> answers (or web pages) so that path isn't well supported.

We're talking to Vint at Google around IPv6 deployment; would be good
maybe to use a set of servers they might provide world wide, for the top
stratum.

> 
> 
> Another worm is how many servers to use.  ntpd basically expects you to 
> configure it with several servers.  Then it will pick the best one and use 
> the others as a sanity check.  One server works as expected - the downstream 
> servers follow it for good or bad.  4 is the next interesting number.  That 
> leaves 3 if one is down.  With three, you can outvote a single bad guy.  
> (Mills often uses colorful language.  He calls the bad guys falsetickers.  
> The good guys are truechimers.)  Think Leslie Lamport and Byzantine Generals.

Ah, yes, the generals problem.

> 
> Another frill/feature is authentication.  I haven't paid much attention to 
> it.  The basic idea is for the server to cryptographically sign the packets 
> so clients can be sure they aren't getting spoofed time.
> 
> ---------
> 
> ntp has two ways to adjust the time.  Magic words are slew and step.
> 
> slew is good. Step is bad, especially if it goes backwards.
> 
> slew means fudge the time per tick to be slightly more/less than normal.  
> Most code doesn't notice anything.  The typical fudge factor is 500 ppm.
> This only works for small changes.  Big changes take too long.

yeah, and to complicate all this will be our propensity to drop into S3
state at the drop of a hat...  Can you say challenge?

> 
> The downside of slew if that it takes a long long time to correct the clock 
> if you are way off.  ntpd defaults to stepping if the change is more than 128 
> ms.
> 

Mark, what's the accuracy of the TOY crystal in our machine?

> ntpd has a sanity check (or several).  If it thinks it should step more than 
> 1000 seconds, it logs an error message and exits, assuming a human will fix 
> things.

Yeah, the usual cause is a dead TOY battery.  Mark, how long will ours
likely last in the field?

> 
> There is usually a hack to allow a single big step during startup.  The old 
> approach was to run ntpdate (possibly with the same set of servers that ntp 
> uses) before starting ntpd.  Recent ntpd has the -g switch that allows one 
> big step on startup.
> 
> ----------
> 
> ntp/ntpd is normally both a client and a server.  The general idea is that 
> the root of the tree is some outside source of time:  GPS, WWVB, direct wire 
> to atomic clocks...  Watch and/or phone call, sundial.  Some are better 
> quality
> than others.  Those clocks are called stratum 0.  Yes, there are multiple 
> "roots" to this forest.
> 
> Stratum <n> servers get their time from stratum <n-1> servers.  They are 
> clients when talking to the lower stratum servers and servers when a 
> higher-stratum client asks them for the time.
> 
> The normal NTP package supports various gizmos you can connect to a PC to 
> make a stratum 1 server.  (Stratum 0 is the gizmo itself.)  Refclock is the 
> buzzword.  It's a sat of compile time options.

Interesting question whether we should ask the countries to provide
stratum 1 servers, and how many?  Or whether we can get Google to do
this for us, if we ask.

> 
> The usual trick is for the gizmo to connect via a serial port.  Often they 
> flap a wire once per second - PPS, Pulse Per Second.  You connect that up to 
> one of the modem control signals and patch the kernel to record the time when 
> that signal interrupts.  There is an RFC with the details.  Linux support is 
> poor.  It requires a patch.  There are two different patches/people for 2.4 
> vs 2.6 kernels.  Support on FreeBSD and NetBSD is good to great.
> 

Hmmm.  Do you know why the Linux support sucks?  And where we can find
the current patch.

> There is another tangle in here.  The kernel code can also process the PPS 
> data
> and directly modify the clock parameters.  I don't understand why that would 
> work any better than going out to a user application to do the processing and 
> then back into the kernel with the digested parameters.  Mumble.  It only 
> matters for serious geeks.

We're all serious geeks here.

> 
> For running your own stratum 1 server, the best deal I know of today is the 
> Garmin GPS-18 LVC.  You can get one for under $100.  Some assembly required.  
> The typical hack is to power it from USB.  (Beware, there are two other 
> versions of the GPS-18.  They don't support the PPS signal that you need for 
> good timekeeping.)

Interesting; might be cheap enough to put one in each school or so.

> 
> ----------
> 
> Do you know about "Allan deviation", or "Allen intercept"? It's a measure of 
> goodness of a clock.  If you measure the time/frequency at N second intervals 
> and plot the "error", you get two lines.  At the short times, the clock 
> may be stable but the error is 1 clock tick per measurement so the observed 
> error gets better as you average that over a longer time.  At long times you 
> get the (assumed) RMS error - longer averaging time gives more time for the 
> clock to drift.  If you plot error vs averaging time, you get a big V.  I 
> think the slope on the left (short times) is -1 and the slope on the right is 
> +1/2 (sqrt for RMS).
> 
> Recent/current versions of ntpd adjust their polling interval to track the 
> Allan deviation.  It ranges from 64 seconds to 1024 seconds (17 minutes).  
> That matches the parameters of most systems.  They polling interval typically 
> ramps up to 1024 on stable systems and steps back when the temperature 
> changes.
> 
> There is some special case (maybe just manual configuration) that extends 
> that out to a day or two.  It makes sense if you are dialing up to get your 
> time.  (The dialing stuff has a lot of jitter so the crossover point gets 
> pushed out to a much longer time.)

We can expect almost any network you can imagine for backhaul.

> 
> 
> Another important idea in ntp is that it assumes that the delays on the 
> network are symmetric.  In reality, the delay is often on one direction or 
> the other.  If you have a good clock and are using a good server, and you 
> plot RTT vs observed offset (assuming symmetry), you get a wedge pointing 
> left.  Good example here:
>     http://www.ijs.si/time/#net-jitter
> The point (left) of the wedge is the no-delay case.  The edges going up-right 
> and down-right from the point are delay in one direction or the other.  Stuff 
> in the middle is delay on both directions.
> 
> If you have a long path (many hops), you can notice the shift as the routing 
> changes.  If you have good clocks at both ends, you can see the asymmetry in 
> routing showing up as clock error.
> 
> ----------
> 
> Crystals are typically off 10 to 100 ppm.  Your clock gets a lot better after 
> you correct for that.  Let's do the math.  100K seconds per day means 10 ppm 
> is 1 second per day.  So 100 ppm error would be 10 seconds per day.

Yeah, I'm mostly familiar about this stuff: astronomers care a lot about
time: it makes it much easier to find things in the sky ;-).  I even
know they make ovens for crystals, to improve their behavior alot, if
you spend more money.

> 
> ntpd calls that "drift" and stores the answer in the file system, typically 
> /etc/ntp/drift (RH 7) or /var/lib/ntp/drift (FC 4) or /var/db/ntpd.drift 
> (FreeBSD or ...
> 
> NTP has a limit of 500 ppm.  (It's tangled with the math.  NTP is a phase 
> lock loop.  There are stability considerations.)  Sometimes the software 
> screws up this area.
> 
> There is enough temperature dependence in crystals so that you can easily see 
> the correlation if you plot drift and temperature over time.  Ballpark is 1 
> ppm per C.
> 
> Temperature of the crystal is what matters.  It usually tracks the CPU 
> loading.  (I
> can easily tell when I start reading my mail and/or when long cron jobs go 
> off.)
> 
> You can also get better accuracy by going the other direction - measure 
> temperature and feed that into the time calculations.  Graph here:
>   http://www.ijs.si/time/#temp-dependency
> Good writeup of how to do the temperature corrections:
>   http://www.ijs.si/time/temp-compensation/
> (Fun read, time sink warning, probably not appropriate for your toys.)
> 
> ----------
> 
> There are a couple of quirks that I don't know much about.
> 
> NTP doesn't work very well over dialup links.  The problem is how to 
> coordinate getting time when the line is up without running up a lot of phone 
> charges just to keep the time accurate when the system is otherwise idle 
> and/or getting confused by the delay to setup the connection.
> 
> There is a burst mode.  Rather than a single packet, it uses a burst of 8.  
> One use is to get around setting up transient connections and loading ARP 
> caches.  Another is to try to dance around long queues on the backbone 
> routers.
> 
> There is a broadcast/multicast mode.  The general idea is to exchange a few 
> packets between client and server at startup to determine the round trip 
> time, and then the client just listens, fudging the answer by the offset 
> previously computed.  This may be appropriate for your usage patterns.  Is 
> your mesh going to support broadcast/multicast?

Yes, I believe so.

> 
> The current code does DNS lookup on host names once at startup and expects to 
> run forever.  That screws up if the DNS is broken at boot time or one of the 
> target servers changes IP address.  People are talking about fixing this.  I 
> haven't been paying attention to the details.  This gets tangled up with 
> various DNS hacks like the pool project.
> 
> ----------
> 
> Do you know how you are going to implement time on each machine?

Nope.

> 
> Plan A is to use the CPU cycle counter.  I haven't looked carefully, but that 
> has troubles with SMP.  (Obviously not a problem for you.)

Can't.  We like to turn off the CPU a lot.

> 
> Plan B is to use the scheduler interrupts, probably driven by a 32 KHz 
> crystal on the TOY clock.  This disadvantage of this is that the tick size is 
> big, typically 10 ms.  You can mostly fix that by interpolating with the CPU 
> counter.  That gives a tick size in the sub-microsecond range.  There is a 
> version of syscall to get the time that returns nanoseconds rather than 
> microseconds.
> 
> I think the next step is to figure out how good a time you want or need.  Is 
> within a second good enough?  Make gets into trouble if client and NFS server 
> are off by that much.
> 
> The next parameter is how often do you reboot?  Are you going to have a 
> battery backed TOY clock?  That seems like a good target for cost reduction.  
> If not, how many times per day are you going to power up (aka reboot)?

Dunno for absolute sure.  I haven't had time to explore the schematics
in detail.

Mark?

> 
> Are you going to turn the CPU really off (turn off the oscillator and PLLs) 
> when idle?  It takes a few ms to get going again.  That's probably too much 
> work for not enough power savings.  (I'm assuming that RAM refresh will be a 
> good lower limit.  Are you going to do sneaky things like copy stuff to 
> Flash, make pages write protected... so you can get rid of refresh?)

yes, and it takes the stupid geode 150milliseconds to get going again.

I don't think we'll usually suspend to disk(flash).

RAM will go into self-refresh.
> 
> Assuming you have a TOY clock, it should be good for (much) better than a 
> second per day.  If a second per day is all you need, you should be able to 
> get by with a packet exchange once per day.  (Figuring out when to do that 
> may be more expensive than doing it more often.)
> 
> The low tech approach is to run a cron job occasionally and just smash the 
> time to the correct value.  That's easy to understand, but much less good 
> (overall) than keeping track of the error/drift.  It's probably what I'd 
> start with.
> 
> Assuming you want something better than that...
> 
> One plan is to run a ntpd client/server on each node, adding a stratum for 
> each hop.  That's probably not the best way, but I'm sure it can be made to 
> work.  My not-good opinion is assuming your mesh topology will change 
> reasonably often which would change the NTP topology and tree structure.

Yup.

> 
> NTP packets only have room for 16 hops.  How deep is your mesh going to be?

Not that big: realistically, Michail believes 4-5 hops max.

> 
> The other approach is for each base station to run a NTP server and for the 
> nodes of the local mesh to all use it as their (primary) server.  Are you 
> going to use DHCP?  Does the answer come from the base station?  If so, that 
> seems like a reasonable fit, assuming somebody can teach NTP to use DHCP.

Probably advertise the base station NTP server via mDNS.  See
www.avahi.org.

> 
> That brings up the next question.  What sort of connectivity are you going to 
> have from a base station (if that is even a reasonable concept) to the rest 
> of the world?  What's the ratio of end/mesh nodes to base stations?  Is it 
> reasonable to manually configure base stations?  .

Depends.

Anything from a wet string, or IP over avian carriers, to fiber, and
everything in between I expect.

Well, there are 30,000 schools in Thailand.  Manual configuration seems
like a way to ask for a linksys or dlink disaster.

> ..
> 
> You might be able to piggyback time on your mesh network routing stuff.  
> Probably not worth the effort unless you need good time.  (Audio, video, 
> realtime stuff.)
> 
> ----------
> 
> What about the base stations?  I'm assuming they want somewhat better time.
> 

I'd hope so.  I'm presuming you mean by base station the server machines
we expect to put into the schools for content and software distribution,
among other services.

> Are they going to be highly resource constrained environments?  Will they be 
> running 24/7?  Will they have real power?  Disks?

We hope so, but reality says 24/7 may not happen, due to flaky power.
We expect them to have disks,
> 
> What's the ratio of mesh nodes to base stations?
> 

I expect it will vary all over the map.

Thanks for the braindump.
                                   Regards,
                                             - Jim

-- 
Jim Gettys
One Laptop Per Child