Every year Dan Russell at IBM Almaden hosts a small one-day workshop for HCI and related researchers to schmooze and talk a particular subject. This year's topic was near & dear to my heart: what do we do with WAAAY too much information? The first half of the day featured talks from what Dan jokingly described as "people making the problem worse," the second half dealt with specific methods for trying to understand huge amounts of information.
One common thread through all the projects described is their huge scale. The Internet Archive has the modest goal of making all of mankind's published work available to everyone in the world, Microsoft's MyLifeBits project is trying to store every piece of data a person ever touches in his entire life, Andreas Weigend is trying to make sense out of the millions of transactions Amazon.com makes on a daily basis, etc. Such ambition is made possible by the usual culprits: geometric increases in storage capacity and processing power, and the continued digitization and networking of large segments of society.
Dan set the stage with a brief description of Millipede, IBM's still-experimental nanotech storage system that potentially offers 114 terabits of data per square inch, at a transfer rate of 800 Gbits/second. The primary question for the rest of the day: what do you do with a terabyte on your laptop (or, I might add, your cellphone)?
Kahle's goal is simple: All published works of mankind, available to all members of the world. He raises four questions:
Yes.
The numbers are both staggering and staggeringly within reach: he estimates about 28 Terabytes for all the texts in the Library of Congress (in ASCII format). That's like a meter of storage space and $60,000 or less in Linux boxes. Slightly trickier is scanning: it currently costs around $10-15 to digitize a book, outsourcing the job to India. They're about to start scanning a million books in-the-library in Canada.
The IA was also behind the Internet Archive Bookmobile, a van with satellite-dish on the roof and printing equipment inside. With local volunteer labor, they're looking at printing around $1 per book.
Other media: Of the 2-3 million published pieces of audio out there, they've archived around 2000 albums that have been made available under Creative Commons licenses and have around 15,000 live recordings of concerts from tape-trader-friendly bands like The Grateful Dead. Of the few hundred-thousand movies that have been published, about 600 have fallen into the public domain, of which they currently have 300 available. Their current prize movie download is Night of the Living Dead, though I prefer Duck and Cover myself. They've apparently been storing about 20 Terabytes per month of television for years now, though they've only made one week available: TV from around the world from slightly before the planes hit on 9/11/2001 till 9/18/2001. They're trying to extract the bits from software, but the thorny bit is that cracking copy protection on old 1980-era PC games is a violation of the DMCA. They've received a 3-year exemption, but are 1.5 years through and need help with the job. Finally, they archive the Web, taking a 40-50 Terabytes snapshot every 6 months.
Preservation: The lesson from Library of Alexandria v.1.0: make multiple copies. The new Bibliotheca Alexandrina, built near the site of the legendary library, includes a mirror of the Internet Archive, running on 200 machines and holding 100 Terabytes total. The dangers to the archive, Kahle says, are less hardware issues like drives and boards and more things like programmer error, curator error, and governmental "error" (such as burning or shutting down the library). His goal is to install mirrors around the world in order to protect the data. You can also download the open-source software and hardware plans to make your own petabyte server, or it can be custom built and installed for about $2 Million.
Here's the biggest sticking point, especially with stuff published between 1964 and the Internet age. About half the books in the Library of Congress are in the public domain. Of the half that are still in copyright, most are out of print and often it's difficult to even identify who owns the copyright at this point in time. Kahle has filed a lawsuit (Kahle v. Ashcroft) asking for clarification on whether a library has the legal right to distribute in-copyright-but-out-of-print books to patrons.
The question of who owns all this data came up several times during the day, and remains one of the big unanswered questions.
We're sure gonna try.
MyLifeBits is a living experiment that aims to digitize and store every piece of information that crosses a person's path and then hopefully learn what applications are especially useful and what issues might arise with such a system. It's what I call YAMIP (Yet Another Memex- Inspired Project) and is the same vein as RXRC's Forget-Me-Not, Georgia Tech's eClass Project, the MIT Media Lab's Wearable Computing Project and Ricoh CRC's IM3 project. In the past few years they've stored about 44 Gig of data from Gordon Bell's life, including scans of books he's written, recordings of phone calls he's made, etc. It all goes into an SQL Server database, tagged by whatever metadata and annotations they can access.
It takes a lot of dedication to pull off a living experiment like this, and I'm glad to see they're trying it. Unfortunately, I also feel like they're starting from scratch instead of building on the lessons learned from previous work and living experiments. Some of my skepticism comes from the fact that they aren't putting much effort into the user-interface issues, which to me are where the biggest unsolved problems lie. But mostly I'm not sure what, at the end of the day, we'll learn from MyLifeBits that we don't already know: that having lots of information available is useful if the cost is low enough, that making the cost low enough is a really hard interface and information-organization problem and that The Annotation Problem™ is a tough nut to crack.
A couple interesting things they have determined, at least based on their limited researcher-as-user sample set:
My big questions questions about MyLifeBits and similar projects:
Weigend started his talk with a simple question:
The customer pushes add to cart on Amazon.com. It's like a game. The customer made a move, now it's Amazon's move. What should they show?
The rest of his talk detailed how they experimentally answered this and similar questions by mining the logs from their million+ orders per day. Some key points from the talk:
GeoFusion showed an impressive demo of their GIS-visualization engine (currently being licensed by ESRI for their new ArcGlobe product). They started their talk with a CG image of the Earth from space, and as they zoomed in terrain details came into view in almost (but not quite) seamless transitions. They finally stopped at what looked like a satellite photo of the IBM Almaden building where we were meeting. That part was somewhat impressive but not surprising — what surprised everyone was when they then flicked the view angle to show that what we were actually looking at was a texture-mapped 3D model of, well, the entire Earth, albeit with only a few high-resolution areas. Very impressive.
Descriptions don't really do the experience justice, nor do the static images they have on their Web site, but if you have the right kind of hardware check out their Mars Demo.
WebFountain is IBM's new platform for doing content analysis on large numbers of unstructured documents, including Web pages, netnews, weblogs, bulletin boards, newspapers, magazines, press releases, etc. They're set up to answer a range of questions, from detection & aggregation (Did my worldwide brand launch gain acceptance in the media?) to relationships (Who are IBM's primary partners and competitors?) to patterns & trends (Are college newspapers starting to talk more about tuition hikes than in previous years?).
Behind WebFountain are large number of data sources including the Web, BBSs, Blogs, PDF, subscriber/paid data feeds, unstructured text from public databases and customer-supplied content. These are run through a blackboard-like system that adds metadata tags such as this is an occurrence of the name of company <Foo>, and everything is indexed by a massively-parallel cluster of Linux boxes at the rate of about 500-600 documents per second.
A couple interesting points that came up during Q&A:
My notes are sparse for this talk, but a very brief description is that McGuinness is developing interactive Web-based tools for maintaining ontologies over time, checking ontologies for (logical) correctness, completeness and style, merging of ontological terms from varied sources, and validation of input. The main software system she described is Chimaera, built on top of the Ontolingua Distributed Collaborative Ontology Environment.