As we've been rolling out Pilot Program units over the past few weeks, we began to encounter some puzzling and difficult to reproduce errors with the P2P network. These were typically manifesting as broken pages, remote nodes disconnecting without warning and even certain sites which would stop working across the entire network simultaneously.

The errors were serious enough that we deemed it necessary to divert our engineering resources to figuring out what was going on. After an epic 10-day sprint, we finally got to the bottom of it.

A majority of the issues were caused as a result of errant IoT devices sending out requests to non-existent sites. These requests would inevitably fail and the errors were being mistakenly blamed on remote peers. This was causing random disconnects and consequent page load errors.

The remaining issues we identified were much more difficult to diagnose as they were almost entirely related to subtle race conditions in P2P code. These are among the most complex issues one tends to encounter in networking apps. However, the ever so useful "go test -race" command came to the rescue and helped us to diagnose over a dozen race conditions which were only showing up at random times and places on the P2P network, causing nodes to lock up in sync with each other.

The great news is that we've now begun pushing this epic update to all Winston devices. If you've been experiencing any problems with random broken pages, this should resolve them.