Personal Web Discovery (aka Webfinger)

There's a particular discovery problem for open and distributed protocols such as OpenID, OAuth, Portable Contacts, Activity Streams, and OpenSocial.  It seems like a trivial problem, but it's one of the stumbling blocks that slows mass adoption.  We need to fix it.  So first, I'm going to name it:

The Personal Web Discovery Problem:  Given a person, how do I find out what services that person uses?

This does sound trivial, doesn't it?  And it is easy as long as you're service-centric; if you're building on top of social network X, there is no discovery problem, or at least only a trivial one that can be solved with proprietary APIs.  But what if you want to build on top of X,Y, and Z?  Well, you write code to make the user log in to each one so you can call those proprietary APIs... which means the user has to tell you their identity (and probably password) on each one... and the user has already clicked the Back button because this is complicated and annoying.

This is also the cause of the "NASCAR Effect" that is plaguing OpenID UIs today -- you are faced with a Hobson's choice of making the user figure out what their OpenID is on their favorite provider, or figuring it out for them by making them click on a simple button... on an ever-growing array of buttons to cover all of your top identity providers and your business partners.  So the UI is more complicated than simple username/password.  This is not a recipe for success.

Next, there's the sharing problem -- if I want to share my calendar with someone, how does my software know what calendaring service my friend uses?  Again, if we're both on the same calendar service, then we're fine; otherwise we're in the situation that email was in decades ago, where you had to figure out the bang-path hop to hop address to reach your intended recipient.  (Note that in this case, the service being discovered is for a user who isn't even present.)

Finally, what is a person on the web?  At the moment we can represent a person as a URL (OpenID) or as an email address (most everybody).  A huge adoption issue for OpenID is the lack of a standard for using an email address as an OpenID.  The lack of such a standard is due to email address privacy concerns, and lack of discovery services for email addresses.  The horse has mostly left the barn on email address privacy already, as everyone uses email addresses for logins, and we just need to be careful about not publishing them publicly.  Discovery is now a solved problem, but the news isn't widely distributed yet.

Last week, over bacon and coffee at Social Web Foo Camp, Blaine, Breno, and I realized that all of the pieces are in place to solve these problems, and that they just need to be hooked up the right way, and threw together a last minute session Sunday morning to talk about it.  Here's my take-away:

Personal Web Discovery Puzzle Piece #1: URLs are people, and so are email addresses.

We allow email addresses anywhere an end user would use an OpenID -- from an end user's point of view, they can use an existing email address as an OpenID.  While we're at it, we allow any sufficiently well formed and discoverable string to function as an OpenID, for example Jabber IDs.  This means that a user can use any login ID as an OpenID, and also that if I know someone's email address from their business card, I share things like my calendar with them (without sending email).  Of course this requires discovery via email addresses to make OpenID work; fortunately that's the second puzzle piece.

Personal Web Discovery Puzzle Piece #2: The new discovery spec is here!

draft-hammer-discovery-03 is hot off the virtual presses this month; section 4.4, The Host Metadata Document, describes the basic piece needed for discovery, but in that spec it's difficult to see how this fits in with puzzle piece #1.  Here's how:  If I provide email addresses at example.com, while redirecting HTTP requests from example.com to www.example.com, I publish a text file at http://www.example.com/host-meta, which contains a line like this one:
Link-Pattern: <http://meta.example.org/?q={%uri}>; 
This means "take the thing you're asking about in URI form -- e.g., mailto:joe@example.com -- stick it in the query parameter to the meta.example.org service, and do a GET on that to retrieve a bunch of metadata about joe@example.com".  The metadata format XRD is itself a simplification of the existing metadata used by OpenID and OAuth today, and it's basically typed links based on URLs.  It maps joe@example.com to the appropriate OpenID provider to be used -- and that itself can be editable, so Joe can choose to use any provider he or she wishes.

So with a bit of swizzling, clients can map from joe@example.com to see if it's usable as an OpenID and if so, where to send the user to log in.  This eliminates the NASCAR effect.  It also means that clients such as web browsers can check to see if the user has a usable OpenID already (it probably has the users' email address from form fill already) and can present a very simple chrome-based "Log in as joe@example.com" on any web site that allows OpenID.  As a nice side effect, we also make the whole system much more phishing-resistant.

But authentication is just one service.  What if I want to provide a way for people to get my public activitity stream, for example?  That's almost trivial; just map joe@example.com to the default activity stream, and _that_ stream is a public Activity Stream feed.  I can also link to my blog and its feeds, my photo stream, my calendar, my address book, etc.  It's a user-centric web of services, tied together by a single identifier and discovery.

What about privacy?

All of the basic discovery use cases don't require any real authentication or security beyond that provided by HTTP(S).  The services pointed at can of course require authentication -- if I publish a calendar endpoint, that doesn't mean I let just anyone see it; or I may make my free/busy times public but my details may be ACL'd.  The process of discovering that a resource is ACL'd and how to go about authenticating so as to get access is just OAuth (or rather, a usage of the draft-hammer-discovery spec that uses types and endpoints specific to OAuth).  So it's discovery all the way down, and it's possible to mix in as much or as little privacy protection as is needed in each case.  The nice thing is that everybody is already standardizing on OAuth.

Sounds nice, but how does this metadata get created?  Out of thin air?

So we have standards ready to go, and could start writing client libraries today.  But where will all of this metadata come from?  What will motivate identity providers to publish this data, and how can we ensure that they allow users to configure it and not lock them in to the providers' own services?

There are several answers.  First, this spec provides more value to an email address -- so email providers have an incentive to provide it.  It's fairly trivial for them to do at least the basics; publish a static file off their main (or www) site, and provide a basic mapping service to point at whatever they have or know already that's public.  So the cost is low, and the potential benefits are high -- and once one email provider does this, it provides more incentive for the others to follow.

Second, some of the metadata is already present; every Yahoo! and Google user already has an OpenID service but none of them know it yet.  So there is value in just hooking up what's automatically provided.  However, this does lead to the danger of lock-in -- it's fine to default to your own service, but you shouldn't be limited to that service and you should also be able to override the defaults, ideally without needing to go and configure boring settings pages.  Profile pages are a valuable source of discovery data here if profile providers allow linking to services elsewhere.

Going Meta

There is another way to bootstrap.  Once you have a personal web discovery metadata service, and a way to edit per-user data, you can also create a personal web update service.  So then if you're at Flickr, and Flickr knows your email address, Flickr can find out, via discovery, if it can update your personal web data; and if so, offer to add itself as a photo stream service.  This would be done via OAuth of course, with your permission.  So services themselves could take care of the grungy work of adding links to your personal web.

Next Steps

Next steps are to get this documented properly, in the form of a HOWTO and running example code and some solid client libraries.  These are worth a million words of spec.

NB: You'll notice in general that there's no brilliant new idea here; this is just putting pieces that already exist together.  In fact, much of this is a re-invention of Liberty WSF discovery, but less SOAP-y and more deployable.


Positive Feedback in Social Search

One suggestion from today's social search session at #swfoo was to send queries off to both search engines and your friends (e.g., "vacations in Venice").  A problem here is that many of your friends are incompetent about vacations in Venice, so sending them this both spams them and decreases results relevancy -- noise increases linearly with overall size of system.  This is why the good results that early adopters with 20K followers have with "what's the best pizza in Sebastopol" aren't scalable.

But, there's a nice solution to this I think.  As you do get results that are somewhat relevant from friends, you click through on their answers.  Your clicks tell the system that friend's answer was relevant in context, allowing it to learn which friends are competent in various fields.  Combine these results across everyone who is asking questions of the same friends to cancel out bias; you're left with a vector of weights for each person in the network, one weight per field of expertise.  Use this to do a few things:
  • Explicit reputation for people who answer, to accompany the implicit social debt incurred
  • Rank their answers higher in search results -- in many cases beating out traditional search engines if they're proved to be less competent
  • Don't spam incompetent people with questions they can't answer
  • Potentially, reach beyond your immediate social network to find the real experts on the subjects and send your question to them.
This is much more scalable than trying to categorize your friends explicitly as experts in various areas.  You'll still do this implicitly, by first clicking on results from friends you already know to be expert, helping to bootstrap the system.  But you never need to know you're doing this; the system learns automatically.

Social Web Foo: Standards for Public Social Web

Small but useful #swfoo session.  My idea was to try to give public social data formats, protocols, and standards some quality time, since (a) privacy and ACLs introduce many difficult problems that eat up lots of discussion time; and (b) there are many key use cases that are totally public, and might be easily solvable if we remove the distraction of privacy controls. @niall, @dewitt, and @steveganz attended, but per Foo rules, I won't attribute specific quotes.

Examples of this include public blogs, update streams, and feeds; and public following/friending relationships.  Typically following (one way) seems to be more likely to be public than friending, for social reasons.

Some random notes:  
  • Public content, once published, should be assumed to be "in the wild" everywhere, indefinitely, until the heat death of the universe.
  • PubSubHubHub (prior session) is a great example of a proposed open standard for improving the performance of public social data.
  • Problem:  How does an author prove authorship of data that's "in the wild" or syndicated?  Conversely, how do readers determine authenticity of an authorship claim?
  • Blogger's import/export facility currently "wrings the identity" out of the data, because we don't have any way to detect tampering with the supposed author/post/comment data between export and import.
  • There was a suggestion that signing a subset of fields in an Atom entry with Google's public key could provide authorship attestation for that data (content, title, author, etc.), in UTF-8 only, which would then let us solve the import/export and syndication attribution problems without having to deal with DigSig.
  • Great example of a situation where a hosting web site needed attestation from a chain of 3 parties before allowing possibly copyright-infringing content to be uploaded; no standard exists for doing this online.
  • Would like to be able to link to a real world identity (vouched for) or to at least a profile provided by someone like Google; there are lots of pieces of data that would let Google vouch for identity of a profile owner, but no standard way to express this publicly.
  • Google for example could also do more general reputation which could also be public.
  • A public social graph consisting of following relationships is both useful, and potentially honestly mine-able, assuming users opted in with full knowledge that data was public and mine-able; this is very different from private relationships.
  • Public social graph is also potentially a way to determine public reputation; it's possible to game this, but difficult especially if the relationships are publicly visible on the open web so that subverting them believably would take months or years of stealth work.
  • Being able to verify past employment, educational credentials, etc. (data that a user chooses to make public and verifiable) would be very useful.

Deep Thought at Social Web Foo

Not mine; these guys:


Happy Birthday, RFC!

40 years ago today, the RFC (Request For Comment) was born -- RFC 1, "Host Software", was written April 7, 1969. Steve Crocker, the author, described its genesis in an op-ed piece for the New York Times. The humble RFC system is the basis for the entire infrastructure of the Web; it's amazing how far rough consensus and running code will get you.