Intelliseek will be a big corpus of spidered and annotated blog posts to attendees at the 3rd Annual Workshop on the Weblogging Ecosystem (held in conjunction with the WWW 2006 Conference in Edinburgh, Scottland):
The data release comprises a complete set of weblog posts for three weeks in July 2005 (on the order of 10M posts from 1M weblogs). This data set has been selected as it spans a period of time during which an event of global significance occurred, namely the London bombings.
The data set includes the full content of the posts plus mark-up. The marked-up fields include: date of posting, time of posting, author name, title of the post, weblog url, permalink, tags/categories, and outlinks classified by type - details may be found here.
Sounds like a great resource for researchers. I'm also amused (in a dark sort of way) by the datashare individual agreement they require people to sign — essentially they admit that there's no way they can get copyright clearance from all million or so bloggers they've collected, so they just ask everyone to agree to remove any posts if anyone complains, not use the results for commercial purposes and not use it passed the workshop.
Posted by bug to Media Technology at December 30, 2005 11:58 PM | TrackBackThis is nifty. We thought about getting organized and formally distributing our 70 gig dataset from LiveJournal, but never quite got around to it -- and the datashare agreement I think demonstrates why that might not be a bad thing.
Their terminology is also interesting. "A complete set"..."on the order of 10M posts from 1M weblogs"? Well, 1.3M people updated their LiveJournal blog in the last month, so call that a million blogs updated in three weeks. LJ is in the top three hosting sites, last I checked, so mutiply that out. Hmmm... if you subtract private (and thus nonexistent for searching) posts, I could see arriving at 1M blogs total updated. I'm still surprised it's not bigger, basically.
Posted by: jopoission at December 31, 2005 12:55 AMI think they must mean "complete" as in all the posts from the blogs they indexed rather than all the known blogs out there. Technorati estimates there are about 70,000 new blogs a day and claims to track 24.4 million sites, and Intelliseek's own BlogPulse claims over 20 million.
Posted by: Bug at January 3, 2006 2:12 PM