Skip to main content

Network Protocols and Vectorization

Doing things in parallel is one of the older performance tricks.  Vector SIMD machines -- like the Cray supercomputers -- attack problems that benefit from doing the same thing to lotsof different pieces of data simultaneously.  It's just a performancetrick, but it drove the design and even the physical shape of thosemachines because the problems they're trying to tackle -- airflowsimulation, weather prediction, nuclear explosion simulation, etc. --are both important and difficult to scale up.  (More recently, we'reseeing massively parallel machines built out of individual commodityPCs; conceptually the same, but limited mostly by networklatency/bandwidth.)

So what does this have to do with network protocols?  Just as the problems of doing things like a matrix-vector multiply very, very fast drove the designs of supercomputers, the problems of moving data from one place to another very quickly, on demanddrive the designs of today's network services.  The designs of networkAPIs (whether REST, SOAP, XML-RPC, or whatever) need to take thesedemands into account.

In particular, transferring lots of small pieces of data in serialfashion over a network can be a big problem.  Lots of protocols thatare perfectly fine when run locally or over a LAN fail miserably whenexpected to deal with 100-200ms latencies on a WAN or the Internet. HTTP does a decent job of balancing out performance/latency issues forretrieving human readable pages -- a page comes down as a medium-sizedchunk of data, followed by, if necessary, associated resources such asscripts, style sheets, and binary images, which can all be retrieved inparallel/behind the scenes.  Note, that this is achieved only throughlots of work on the client side and deep knowledge of the interactionsbetween HTML, HTTP, and the final UI.  The tradeoff is complexity ofprotocol and implementation.

How does this apply to network protocols in general?  One idea is tocarefully scrutinize protocol requests that transfer a single smallpiece of data.  Often a single small piece of data isn't very useful onits own.  Are there common use cases where a system will do this in aloop, perhaps serially, to get enough data to process or present to auser?  If so, perhaps it would be a good idea to think of "vectorizing"that part of the protocol.  Instead of returning a single piece ofdata, for example, return a variable-length collection of those piecesof data.  The semantics of the request may change only slightly -- from"I return an X" to "I return a set of X".  Ideally, the length shouldbe dynamic and the client should be able to ask for "no more than N" oneach request.

For example, imagine a protocol that requires a client to firstretrieve a set of handles (say, mailboxes for a user) then query eachone in turn to get some data (say, the number of unread messages).  Ifthis is something that happens often -- for example, automaticallyevery two minutes -- there are going to be a lot of packets hittingservers.  If multiple mailboxes are on one server, it would be fairlytrivial to vectorize the second call and effectively combine the twoqueries into one -- call it "get mailbox state(s)".  This would let aclient retrieve the state for all mailboxes on a given server, withbetter latency and far less bandwidth than the first option.  Of coursethere's no free lunch; if a client is dealing with multiple servers, itnow has to group the mailboxes for each server for purposes ofretrieving state.  But conceptually, it's not too huge of a leap.

There are other trade-offs.  If the "extra" data is large -- like abinary image -- it might well be better to download it separately,perhaps in parallel with other things.  If it's cacheable, but the maindata isn't, it may again be better to separate it out so you can takeadvantage of things like HTTP caching. 

To summarize, one might want to vectorize part of a network protocol if:
  • Performance is important, and network latency is high and/or variable;
  • The data to be vectorized are always or often needed together in common use cases;
  • It doesn't over-complexify the protocol;
  • There's no other way to achieve similar performance in other ways (parallel requests, caching, etc.)
Of course, this applies to the Atom API. There's a fair amount of vectorization in the Atom API from the start,since it's designed to deal with feeds as collections of entries.  Ithink there's a strong use case for being able to deal with collectionsof feeds as part of the Atom API as well, for all the reasons givenabove.  Said collections of feeds might be feeds I publish (so I wantto know about things like recent comments...) or perhaps feeds I'mtracking (so I want to be able to quickly determine which feeds havesomething interesting, before downloading all of the most recentdata).  It would be interesting to model this information as a synthetic feed, since of course that's already nicely vectorized.  But there are plenty of other ways to achieve the same result.

Popular posts from this blog

Personal Web Discovery (aka Webfinger)

There's a particular discovery problem for open and distributed protocols such as OpenID, OAuth, Portable Contacts, Activity Streams, and OpenSocial.  It seems like a trivial problem, but it's one of the stumbling blocks that slows mass adoption.  We need to fix it.  So first, I'm going to name it:

The Personal Web Discovery Problem:  Given a person, how do I find out what services that person uses?
This does sound trivial, doesn't it?  And it is easy as long as you're service-centric; if you're building on top of social network X, there is no discovery problem, or at least only a trivial one that can be solved with proprietary APIs.  But what if you want to build on top of X,Y, and Z?  Well, you write code to make the user log in to each one so you can call those proprietary APIs... which means the user has to tell you their identity (and probably password) on each one... and the user has already clicked the Back button because this is complicated and annoying.

The problem with creation date metadata in PDF documents

Last night Rachel Maddow talked about an apparently fake NSA document "leaked" to her organization.  There's a lot of info there, I suggest you listen to the whole thing:

http://www.msnbc.com/rachel-maddow/watch/maddow-to-news-orgs-heads-up-for-hoaxes-985491523709

There's a lot to unpack there but it looks like somebody tried to fool MSNBC into running with a fake accusation based on faked NSA documents, apparently based on cloning the document the Intercept published back on 6/5/2017, which to all appearances was itself a real NSA document in PDF form.

I think the main thrust of this story is chilling and really important to get straight -- some person or persons unknown is sending forged PDFs to news organization(s), apparently trying to get them to run stories based on forged documents.  And I completely agree with Maddow that she was right to send up a "signal flare" to all the news organizations to look out for forgeries.  Really, really, really import…
Twister is interesting.  It's a decentralized "microblogging" system based on putting together existing protocols:  Bitcoin, distributed hash tables, and Bittorrent.  The most interesting part for me is using Bitcoin for user registration and spam control.  Federated systems handle this with federated trust, which is at least conceptually simple.  The Twister/Bitcoin mechanism looks intriguing though I don't know enough about Bitcoin to really comment.  Need to read further.