[note: I originally scribbled this on paper thinking I could hand it off immediately, preventing the obligation of typing, posting, etc. Turns out I don't get off that lightly, so here's the spew in electrons.]
Scraping isn't a scalable model.There are biz issues around aggregating data: many businesses don't want you to get their data, though many are becoming more open.
Doing aggregation right:
* minimize latency
* maximize engagement
When latency is high, it causes confusion and takes you out of real-time
Doing conditional gets can be somewhat useful.
Plaxo had to shard their crawlers, which lands you in the shared state/sync problem of any stateful system you want to scale horizontally.
Gnip integration has been good:
* Offload the long-running processes
* Gnip offers alerting or "fat ping" (ping includes update data)
Plaxo likes using the alert to escalate the priority of the crawler which fetches the rich data related to the update. This approach allows you to use a consistent model for content ingestion vs. get info from fat ping, then augment later.
Smarr: "Brad Fitzpatrick said, 'Make polling a special case of push.'" He attributed this to someone but I missed the attribution.
(Don't try to keep up with Joseph Smarr on paper. He's thinks too many cogent thoughts too quickly to preserve legibility)
Plaxo uses TripIt's RSS feed as alerting, grabs item ID, then uses their APIs to fetch rich data.
There's a move to homogenize the info from sites, which may not be a good idea. It suppresses the distinctive look and feel/experience of the publishing site. Allowing for these differences means more labor spent on making one-off shims, which increases maintenance. Still, right choice in order to provide value to the user.
Activity streams seek to provide more rich data in a somewhat normalized, extensible format.
Many/most sites aren't yet perfectly architected for real-time's push, ping, etc.
PubSubHubBub and Activity Streams are externally represented data shards
Plaxo's Pulse started with known architecture issues (in order to ship) and hit the wall sooner than expected. Threw hardware/software optimizations at the problem to move the wall far enough to give time for rearchitecture, sharding, and working out how to propagate changes throughout the system properly.
None of the NoSQL alternatives are quite ready for prime-time. Smarr: "It should be something that's just a primitive."
Conversation platforms are slightly different sorts of aggregation platforms. There are UI diffs (e.g. pause the stream when indicating interest). Handling the transition from slightly-latent/passive real-time to synchronous real-time/active not yet well-developed (think: when a comment inspires a conversation)
90-99% of the value of the real-time web is realized in not-real-time [unreal-time? ;] This is a big deal for discovery. Twitter and FB make this harder by obscuring history.
Ideal scalability/performance would be an index per user. This would be grossly inefficient due to the number of duplicate entries.
No one has nailed reader-controlled aggregation (Show me Joe's tweets and blogs but not his photos) quite yet.
Smarr: "If we're all kinda [sharing], we're all making each other smarter"
The firehose of info is a hard model to scale to. Ben Metcalfe proposes the garden hose -- a firehose filtered at the source according to your interests, which helps aggregators by allowing them to request the superset of all filters from a given publisher.
We really want to push contexts to the publishers and let them determine which content fits that context. Context shifts over time: Joe doesn't normally read my tweets (and why would he?) but when we're at a conference together, he's much more interested (thus the popularity of hashtags). This is a geographic and purpose-driven context (the conference) as well as Joe's context on me (Jim knows where the good bars are).
Folks like Twitter are so overloaded with info that they might not recognize non-immediate contexts that are interesting to me.
There's also the risk of exposing users to the amount of correlatable public data they have. Many don't want you to apply a transitive closure to identify them in all spaces even though doing so allows you to present a much more convenient UX around what they want you to aggregate.
Someone likened the real-time aggregation problem to a bar conversation: you get snippets here and there and follow your own thread of interestingness.
Three fundamental themes:
* How to specify contexts to data provider/publisher
* How to control access to private data (and carry ACLs with that data)
* How to do all this efficiently
Plaxo implemented polling-back-off (poll infrequently updated sources less frequently). Turns out this is a bad idea, as it introduces latency which makes it feel broken.
There's also the issue of aggregating conversation about web objecs (like blog posts) and how not to divert the conversation from the publisher's site. However, sometimes you want a private discussion of a public object (cf. LinkedIn company groups discussing an article)
Q: What's the state of open standards around this?
A: PubSubHubBub and Activity Streams are very exciting. OAuth as access delegation. There's still a lot of ground to cover.