Abstractioneer by John Panzer: 2004

2004/12/15

Spam spam spam spam...

Back in June, aparently, the FTC said that a do-not-email list (likethe do-not-call list) would not work, and would generate more spambecause spammers would use it as a source of new email addresses. Though it's a bit late now, I have to wonder about the latterpoint. Why not simply map each address into its MD5 checksumbefore storing it?

So foo@example.com would become "a0b6e8fd2367f5999b6b4e7e1ce9e2d2"which is useless for sending email. However, spammers could use any of many available toolsto check for "hits" on their email lists, so it's still perfectlyusable for filtering out email addresses. Of course it would alsotell spammers that they have a 'real' email address on their list, butonly if they already had it -- so I don't think that would be givingthem much information at all.

I still think the list would be useless because spammers would simplyignore it. But it wouldn't generate new spam, and it would driveup the cost of spamming by making the threat of legal action a bit morepossible.

2004/12/14

The Noosphere Just Got Closer

Of course it'll take several years, but Google's just announced project to digitize major university library collections means that the print-only "dark matter" of the noosphereis about to be mapped out and made available to anyone with an Internetconnection. Well, at least the parts that have passed into thepublic domain; the rest will be indexed.

I'm clearly a geek -- my toes are tingling.

2004/12/13

The "5th Estate"

Interesting quote, from my point of view, in this article:

Jonathan Miller, Head of AOL in the US, testifies to the popularity ofCitizen's Media. He says that 60 - 70 per cent of the time people spendon AOL is devoted to ‘audience generated content'.

(Though he's talking mostly about things like message boards and chat rooms, of course, rather than blogs.)

2004/12/06

Welcome MSN Spaces!

A surprise to welcome me back from sabbatical: Microsoft released the beta of MSN Spaces(congratulations guys!). I've been playing with it a bit over thepast few days; there's some very cool stuff there, especially theintegrations between Microsoft applications.

(I've seen a few comments about the instability of the Spaces service; come on folks, it's a beta. And they're turning around bug fixes in 48 hours while keeping up with what has got to be a ton of traffic.)

2004/11/24

The Atom Publishing Protocol Summarized

The slidesfrom Joe Gregorio's XML 2004 talk about the Atom Publishing Protocolare online. It's an excellent summary, and makes a good case forthe document literal and addressable web resource approaches. Thepublishing protocol is where Atom really starts to get exciting.

2004/11/23

Software Patents Considered Harmful

This post by Paul Vick is, I think, a very honest and representative take on software patents -- and in particular the over-the-top IsNot patent -- from the point of view of an innovator. I find myself agreeing with him wholeheartedly:

Microsoft has been as much a victim of this as anyone else, and yetwe're right there in there with everyone else, playing the game. It'sbecome a Mexican standoff, and there's no good way out at the momentshort of a broad consensus to end the game at the legislative level.

And we all know how Mexican standoffs typically end. Paul, myname is on a couple of patents which I'm not proud of either. Butin the current environment, there really isn't a choice: We're alllocked in to locally 'least bad' courses, which together work toguarantee the continuation of the downward spiral (and in the long run,make all companies worse off -- other than Nathan Myhrvold's, of course.)

2004/11/22

Web Services and KISS

Adam Bosworth argues for the 'worse is better' philosophy of web services eloquently in his ISCOC talk and blog entry. I have a lot of sympathy for this point of view. I'm alsoskeptical about the benefits of the WS-* paradigm. They seem tome to be well designed to sell development tools and enterpriseconsulting services.

2004/11/14

Why Aggregation Matters

Sometimes, I feel like I'm banging my head against a wall trying to describe just why feed syndication and aggregation is important. In an earlier post,I tried to expand the universe of discourse by throwing out as manypossible uses as I could dream up. Joshua Porter has written areally good article about why aggregation is a big deal, even justconsidering its impact on web site design: Home Alone? How Content Aggregators Change Navigation and Control of Content.

2004/11/01

Prediction is Difficult, Especially the Future

Mysecond hat at AOL is development manager for the AOL Polls system. This means I've had the pleasure of watching the conventions anddebates in real time while sitting on conference calls watching theperformance of our instant polling systems. Which had some potentialissues, but which, after a lot of work, seem to be just fine now. Anyway: The interesting thing about the instant polling during thedebates was how different the results were from the conventionalinstant phone polls. For example, after the final debate the AOLInstapoll respondents gave the debate win to Kerry by something like60% to 40%. The ABC news poll was more like 50%/50%. Frankly, I don'tbelieve any of these polls. However, I'll throw this thought out: Theonline insta polls are taken by a self selected group of people who areinterested in the election and care about making their opinions known. Hmmm... much like the polls being conducted tomorrow.
I'llgo out on a limb and make a prediction based on the various pollresults and on a lot of guesswork: Kerry will win the popular vote by asignificant margin. And, he'll win at least half of the "battleground"states by a margin larger than the last polls show. But, I make nopredictions about what hijinks might ensue in the Electoral College.

Update 11/11: Well, maybe not...

2004/10/18

Random Note: DNA's Dark Matter

Scientific American's The Hidden Genetic Program of Complex Organismsgrabbed my attention last week. This could be the biologicalequivalent of the discovery of dark matter. Basically, the 'junk'or intron DNA that forms a majority of our genome may not be junk atall, but rather control code that regulates the expression of othergenes.

The programming analogy would be, I think, that the protein-codingparts of the genome would be the firmware or opcodes while the controlDNA is the source code that controls when and how the opcodes areexecuted. Aside from the sheer coolness of understanding how lifeactually works, there's a huge potential here for doing useful geneticmanipulation. It's got to be easier to tweak control code than totry to edit firmware... (Free link on same subject: The Unseen Genome.)

2004/10/11

Things in Need of a Feed

Syndicated feeds are much bigger than blogs and news stories; they're aplatform. A bunch of use cases, several of which actually exist in some form, others just things I'd like to see:

Blog entries for blogs I'm interested in

Feed of all comments on entries I've authored

News stories matching a custom filter I've set up

Traffic conditions on my customary route(s)

Fedex shipping feed giving status and history for all of my packages

Customer support feed giving status and history for all my issues (any company)

Product safety/recall information for everything I buy

Amazon feed of new books matching my preferences

All new material by a specific author (on any blog or online source)

Feed of new feeds, of various types:

Just my friends

Authored by people whose blogs I already subscribe to

Filtered on personal profile/interests

House for sale listings
Newly discovered prime numbers (okay, a niche audience)
Airport flight status alerts
Movies in my Netflix queue and recommendations
Audio / video content pushed onto my iPod (Podcasting)
Auction information

Addendum 11/11:

Multiplayer game results feed
New government publications feed
New computer virus alerts feed (with metadata giving virus signatures)
Book queue

2004/10/05

Niche Markets

Niche markets are where it's at: Chris Anderson's The Long Tailis exactly right. The Internet not only eliminates the overhead ofphysical space but also, more importantly, reduces the overhead offinding what you want to near-zero. When your computer tracks yourpreferences and auto-discovers new content that you actually want, it enables new markets that couldn't otherwise exist.

Update 10/11: Joi Ito's take.

2004/08/01

Network Protocols and Vectorization

Doing things in parallel is one of the older performance tricks. Vector SIMD machines -- like the Cray supercomputers -- attack problems that benefit from doing the same thing to lotsof different pieces of data simultaneously. It's just a performancetrick, but it drove the design and even the physical shape of thosemachines because the problems they're trying to tackle -- airflowsimulation, weather prediction, nuclear explosion simulation, etc. --are both important and difficult to scale up. (More recently, we'reseeing massively parallel machines built out of individual commodityPCs; conceptually the same, but limited mostly by networklatency/bandwidth.)

So what does this have to do with network protocols? Just as the problems of doing things like a matrix-vector multiply very, very fast drove the designs of supercomputers, the problems of moving data from one place to another very quickly, on demanddrive the designs of today's network services. The designs of networkAPIs (whether REST, SOAP, XML-RPC, or whatever) need to take thesedemands into account.

In particular, transferring lots of small pieces of data in serialfashion over a network can be a big problem. Lots of protocols thatare perfectly fine when run locally or over a LAN fail miserably whenexpected to deal with 100-200ms latencies on a WAN or the Internet. HTTP does a decent job of balancing out performance/latency issues forretrieving human readable pages -- a page comes down as a medium-sizedchunk of data, followed by, if necessary, associated resources such asscripts, style sheets, and binary images, which can all be retrieved inparallel/behind the scenes. Note, that this is achieved only throughlots of work on the client side and deep knowledge of the interactionsbetween HTML, HTTP, and the final UI. The tradeoff is complexity ofprotocol and implementation.

How does this apply to network protocols in general? One idea is tocarefully scrutinize protocol requests that transfer a single smallpiece of data. Often a single small piece of data isn't very useful onits own. Are there common use cases where a system will do this in aloop, perhaps serially, to get enough data to process or present to auser? If so, perhaps it would be a good idea to think of "vectorizing"that part of the protocol. Instead of returning a single piece ofdata, for example, return a variable-length collection of those piecesof data. The semantics of the request may change only slightly -- from"I return an X" to "I return a set of X". Ideally, the length shouldbe dynamic and the client should be able to ask for "no more than N" oneach request.

For example, imagine a protocol that requires a client to firstretrieve a set of handles (say, mailboxes for a user) then query eachone in turn to get some data (say, the number of unread messages). Ifthis is something that happens often -- for example, automaticallyevery two minutes -- there are going to be a lot of packets hittingservers. If multiple mailboxes are on one server, it would be fairlytrivial to vectorize the second call and effectively combine the twoqueries into one -- call it "get mailbox state(s)". This would let aclient retrieve the state for all mailboxes on a given server, withbetter latency and far less bandwidth than the first option. Of coursethere's no free lunch; if a client is dealing with multiple servers, itnow has to group the mailboxes for each server for purposes ofretrieving state. But conceptually, it's not too huge of a leap.

There are other trade-offs. If the "extra" data is large -- like abinary image -- it might well be better to download it separately,perhaps in parallel with other things. If it's cacheable, but the maindata isn't, it may again be better to separate it out so you can takeadvantage of things like HTTP caching.

To summarize, one might want to vectorize part of a network protocol if:

Performance is important, and network latency is high and/or variable;
The data to be vectorized are always or often needed together in common use cases;
It doesn't over-complexify the protocol;
There's no other way to achieve similar performance in other ways (parallel requests, caching, etc.)

Of course, this applies to the Atom API. There's a fair amount of vectorization in the Atom API from the start,since it's designed to deal with feeds as collections of entries. Ithink there's a strong use case for being able to deal with collectionsof feeds as part of the Atom API as well, for all the reasons givenabove. Said collections of feeds might be feeds I publish (so I wantto know about things like recent comments...) or perhaps feeds I'mtracking (so I want to be able to quickly determine which feeds havesomething interesting, before downloading all of the most recentdata). It would be interesting to model this information as a synthetic feed, since of course that's already nicely vectorized. But there are plenty of other ways to achieve the same result.

2004/07/04

Office Space

How important is the physical workspace to knowledge workers generally,and software developers specifically? Everybody agrees it'simportant. Talk to ten people, though, and you'll get nine differentopinions about what aspects are important and how muchthey impact effectiveness. But there are some classic studies thatshed some light on the subject; looking around recently, they haven'tbeen refuted. At the same time, a lot of people in the softwareindustry don't seem to have heard of them.

Take the amount and kind of workspace provided to each knowledgeworker. You can quantify this (number of square feet,open/cubicle/office options). What effects should you expect from,say, changing the number of square feet per person from 80 to 64? Whatwould this do to your current project's effort and schedule?

There's no plug-in formula for this, but based on the available data,I'd guesstimate that the effort would expand by up to 30%. Why?

"Programmer Performance and the Effects of the Workplace"describes the Coding War Games, a competition in which hundreds ofdevelopers from dozens of companies compete on identical projects. (Also described in Peopleware: Productive Projects and Teams.) Thedata is from the 1980's, but hasn't been replicated since as far as Ican tell. The developers were ranked according to how quickly theycompleted the projects, into top 25%, middle 50%, and bottom 25%. Thecompetition work was done in their normal office environments.

The top 25% had an average of 78 square feet of dedicated office space.
The bottom 25% had an average of 46 square feet of dedicated office space.
The top 25% finished 2.6 times faster, on average, than the bottom 25%, with a lower defect rate.
They ruled out the idea that top performers tended to be rewarded with larger offices.

Now, whether larger workspaces improve productivity, or whether moreproductive people tend to gravitate to companies with largerworkspaces, doesn't really matter to me as a manager. Either way, theanswer is the same: Moving from 46 square feet per person to 78 squarefeet per person can reduce the time to complete a project by a factorof up to 2.6x. That's big. (Of course there were other differencesbetween the environment of the top 25% and the bottom 25%, but they arelargely related to issues like noise, interruptions, and privacy. Itseems reasonable to assume these are correlated with people density.)

It itself, this doesn't give us an answer for the question we startedout with (changing from 80 square feet to 64 square feet per person,and bumping up the people density commensurately). As a firstapproximation, let's assume a linear relationship between dedicatedarea per person and productivity ratios. 64 is just over halfwaybetween 46 and 78, so it seems reasonable to use half of the 2.6factor, or 1.3, as a guesstimate. So using this number, a project thatwas going to take two weeks in the old environment would take 1.3 timesas long, or around two and a half weeks, in the new environment. (Inthe long term, of course.)

To put this into perspective, it appears that increasing an organization's CMM level by one generally results in an 11% increase in productivity, and that the ratio of effort between worst and best real-world processes appears to be no more than 1.43.

You can't follow the numbers blindly here. This probably depends a loton the kind of work you actually do, and I can think of dozens ofcaveats. My gut feeling is that the penalty is likely to be more like10% than 30%, assuming you're really holding everything else (noise,interruptions, etc.) as constant as possible. I suspect that theorganizations which are squeezing people into ice cube sized cubiclesare likely to be destroying productivity in other ways as well. But,these numbers do provide some guidance as to what to expect in terms ofcosts and consequences of changing the workplace environment.

Links and references:

In How office space affects programming productivity(IEEE Computer Vol. 28 No. 1; Jan 1995, pp. 7676) Capers Jones gives aguideline of at least 80 square feet of space per person, with fullwalls and doors, for optimal productivity.
The most well-documented planning exercise for knowledge worker facilities is IBM's Santa Teresa facility; a discussion is here.
Steve McConnell gives a good overview of this and other issues in Quantifying Soft Factors (IEEE Software Vol. 17 No. 6: Nov/Dec 2000, pp. 9-11).
T. DeMarco and T. Lister , "Programmer Performance and the Effects of the Workplace", Proc. 8th Int'l Conf. Software Eng., ACM Press, New York,1985,, pp. 268-272.
A great anecdote: Joel Spolsky, Bionic Office. He's betting a lot of money that it's effective to equip his company with spacious, private offices.

2004/07/01

Community, social networks, and technology at Supernova 2004

Some afterthoughtsfrom the Supernova conference, specifically about social networks andcommunity. Though it's difficult to separate the different topics.

A quick meta-note here: Supernova is itself a social network of peopleand ideas, specifically about technology -- more akin to a scientificconference than an industry conference. And, it's making a lot of useof various social tools: http://www.socialtext.net/supernova/,http://supernova.typepad.com/moblog/.

Decentralized Work (Thomas Malone) soundsgood, but I think there are powerful entrenched stakeholders that canwork against or reverse this trend (just because it would be gooddoesn't mean it will happen). I'm taking a look at The Future of Work right now; one first inchoate thought is how some of the same themes are treated differently in The Innovator's Solution.

The Network is People - a panel with Chrisopher Allen, Esther Dyson, Ray Ozzie, and Mena Trott. Interesting/new thoughts:

Chris Allen on spreadsheets: Theyare a social tool for convincing people withnumbers and scenarios, just like presentation software is for convincing people withwords and images. So if you consider a spreadsheet social software, well, what isn't social software?
"43% of time is spent on grooming in large monkey troupes." (But wait, what species of monkeys are we talking about here? Where are our footnotes?) So,the implication is that the amount of overhead involved in maintainingtrue social ties in large groups is probably very high. Tools thatwould actually help with this (as opposed to just growing the size ofyour 'network' to ridiculous proportions) would be a true killer app.
Sizeof network is not necessarily a good metric, just one that's easy tomeasure. Some people really only want a small group.

Syndication Nation - panel with Tim Bray, Paul Boutin, Scott Rosenberg, Kevin Marks, Dave Sifry. I felt that this panel had a lot of promise but spent a lot of time onbackground and/or ratholing on imponderables (like business models). Kevin and Tim tried to open this up a bit to talk about some of the newpossibilities that automatic syndication offers. At the moment, it'smostly about news stories and blogs and cat pictures. Someinteresting/new thoughts:

Kevin statedthat # of subscribers to a given feed follows a power law almostexactly, all the way down to 1. So even having a handful of readers isan accomplishment. One might also note that this means the vastmajority of subscriptions are in this 'micropublishing' area.
New syndication possibilities mentioned: Traffic cameras for your favorite/current route.
The Web is like a vast library; syndicated feeds are about what's happening now (stasis vs. change). What does this mean?
The oneinteresting thing to come out of the how-to-get-paid-for-thisdiscussion: What if you could subscribe to a feed of advertising thatyou want to see? How much more would advertisers pay forthis? (Reminds me of a discussion I heard recently about radiostations going back to actually playing more music and lesstalk/commercials: They actually get paid more per commercial-minutebecause advertisers realize their ad won't be buried in a sea of crapthat nobody is listening to.)

2004/06/25

Supernova 2004 midterm update

I'm at the Supernova 2004 conferenceat the moment. I'm scribbling notes as I go, and plan to go backand cohere the highlights into a post-conference writeup. Firstimpressions: Lots of smart and articulate people here, both onthe panels and in the 'audience'. I wish there were more time foraudience participation, though there is plenty of time for informalinteractions between and after sessions. The more panel-like sessions are better than the formal presentations.

The Syndication Nation panel had some good points, but itratholed a bit on standard issues and would have benefited from alonger term/wider vision. How to pay for content is important,but it's a well trodden area. We could just give it a code name,like a chess opening, and save a lot of discussion time...

I am interested in the Autonomic Computing discussion and relatedtopics, if for no other reason than we really need to be able to focussmart people on something other than how to handle and recover fromsystem issues. It's addressing the technical complexityproblem.

Next problem: The legal complexity problem (IP vs. IP:Intellectual Property Meets the Internet Protocol) - I think thisproblem is far harder because it's political. There's no goodsolution in sight for how to deal with the disruptions technology arecausing business models and the structure of IP law.

And, on a minor note, I learned the correct pronunciation of Esther Dyson's first name.

2004/06/20

Atom Proposal: Simple resource posting

On the Atom front, I've just added a proposal to the Wiki: PaceSimpleResourcePosting. The abstract is:

This proposal extends the AtomAPI to allowfor a new creation URI, ResourcePostURI, to be used for simple,efficient uploading of resources referenced by a separate Atom entry.It also extends the Atom format to allow a "src" attribute of thecontent element to point to an external URI as an alternative toproviding the content inline.

This proposal is an alternative toPaceObjectModule, PaceDontSyndicate, and PaceResource. It is almost asubset of and is compatible with PaceNonEntryResources, but differs inthat it presents a very focused approach to the specific problem ofefficiently uploading the parts of a compound document to form a newAtom entry. This proposal does not conflict with WebDAV but does notrequire that a server support WeDAV.

2004/06/05

Atom: Cat picture use case

To motivate discussion about some of the basic needs for the Atom API, I've documented a use case that I want Atom to support: Posting a Cat Picture.This use case is primarily about simple compound text/picture entries,which I think are going to be very common. It's complicatedenough to be interesting but it's still a basic usage.

The basic idea here is that we really want compound documents thatcontain both text and pictures without users needing to worry about thegrungy details; that (X)HTML already offers a way to organize the toplevel part of this document; and that Atom should at least provide away to create such entries in a simple way.

2004/06/04

Who am I?

Technorati Profile

I'm currently a tech lead/manager at Google, working on Blogger engineering.

I'm formerly a system architect and technical manager for web based products at AOL. I last managed development for Journals and Favorites Plus. I've helped launch Public & Private Groups, Polls, and Journals for AOL.

History:

Around 1991, before the whole Web thing, I began mycareer at a startup which intended to compete with Intuit's Quickensoftware on the then-new Windows 3.0 platform. This was greatexperience, especially in terms of what not to do[*]. In 1993 Itook a semi-break from the software industry to go to graduate school at UCSanta Cruz. About this time Usenet, ftp, and email started to beaugmented by the Web. I was primarily interested in machinelearning, software engineering, and user interfaces rather thanhypertext, though, so I ended up writing a thesis on the use of UI usabilityanalysis in software engineering.

Subsequently, I worked for a startup that essentially attempted to doFlash before the Web really took hold, along with a few other things. We had plugins for Netscape and IE in '97. I played a variety of roles-- API designer, technical documentation manager, information designer,project manager, and development manager. In '98 the company was acquired by CAand Imoved shortly thereafter to the combination of AtWeb/Netscape/AOL. (While I was talking to a startup called AtWeb, they were acquired byNetscape and Netscape was in turn acquired by AOL -- an employmenttrifecta.)

At AtWeb Itransitioned to HTML UIs and web servers, working on web and emaillistserver management software before joining the AOL Communitydevelopment group. I worked as a principal software engineer andthenengineering manager. I've managed the engineering team for theAOLJournals product from its inception in 2003 until the present time;I've also managed the Groups@AOL, Polls, Rostering, and IM Botsprojects.

What else have I been doing? I've followed and promoted the C++ standardization process andcontributed a tiny amount to the Boost library effort. On a sidenote, I've taught courses inobject oriented programming, C++, Java,and template metaprogramming for UCSC Extension, and published two articles in the C++ Users Journal.

I'm interested in software engineering, process and agile methods, Webstandards, language standards, generic programming, informationarchitectures, user interface design, machine learning, evolution, anddisruptive innovation,

First Post

The immediate purpose of this blog is to publish thoughts about web technologies, particularly Atom. Of course that suffers from the recursive blogging-about-bloggingsyndrome, so I'll probably expand it to talk about software in general.

What does the name stand for? Mostly, it stands for "something not currently indexed by Google". Hopefully in a little while it will be the only thing you get when you type "Abstractioneer" into Google. Actually it's a contraction of the "Abstract Engineering" which is ameme I'm hoping to propagate. More on that later.