Trip Report: New Paradigms in Using Computers 2004

What do we do with WAAAY too much information?

7/4/2004, IBM Almaden, report by Bradley Rhodes

Every year Dan Russell at IBM Almaden hosts a small one-day workshop for HCI and related researchers to schmooze and talk a particular subject. This year's topic was near & dear to my heart: what do we do with WAAAY too much information? The first half of the day featured talks from what Dan jokingly described as "people making the problem worse," the second half dealt with specific methods for trying to understand huge amounts of information.

One common thread through all the projects described is their huge scale. The Internet Archive has the modest goal of making all of mankind's published work available to everyone in the world, Microsoft's MyLifeBits project is trying to store every piece of data a person ever touches in his entire life, Andreas Weigend is trying to make sense out of the millions of transactions Amazon.com makes on a daily basis, etc. Such ambition is made possible by the usual culprits: geometric increases in storage capacity and processing power, and the continued digitization and networking of large segments of society.

Dan Russell (IBM Almaden)

Introduction: Thinking about WAAAY too much information

Dan set the stage with a brief description of Millipede, IBM's still-experimental nanotech storage system that potentially offers 114 terabits of data per square inch, at a transfer rate of 800 Gbits/second. The primary question for the rest of the day: what do you do with a terabyte on your laptop (or, I might add, your cellphone)?

Brewster Kahle (Internet Archive)

Universal Access to All Knowledge

Kahle's goal is simple: All published works of mankind, available to all members of the world. He raises four questions:

Should we do this?

Yes.

Can we do this?

The numbers are both staggering and staggeringly within reach: he estimates about 28 Terabytes for all the texts in the Library of Congress (in ASCII format). That's like a meter of storage space and $60,000 or less in Linux boxes. Slightly trickier is scanning: it currently costs around $10-15 to digitize a book, outsourcing the job to India. They're about to start scanning a million books in-the-library in Canada.

The IA was also behind the Internet Archive Bookmobile, a van with satellite-dish on the roof and printing equipment inside. With local volunteer labor, they're looking at printing around $1 per book.

Other media: Of the 2-3 million published pieces of audio out there, they've archived around 2000 albums that have been made available under Creative Commons licenses and have around 15,000 live recordings of concerts from tape-trader-friendly bands like The Grateful Dead. Of the few hundred-thousand movies that have been published, about 600 have fallen into the public domain, of which they currently have 300 available. Their current prize movie download is Night of the Living Dead, though I prefer Duck and Cover myself. They've apparently been storing about 20 Terabytes per month of television for years now, though they've only made one week available: TV from around the world from slightly before the planes hit on 9/11/2001 till 9/18/2001. They're trying to extract the bits from software, but the thorny bit is that cracking copy protection on old 1980-era PC games is a violation of the DMCA. They've received a 3-year exemption, but are 1.5 years through and need help with the job. Finally, they archive the Web, taking a 40-50 Terabytes snapshot every 6 months.

Preservation: The lesson from Library of Alexandria v.1.0: make multiple copies. The new Bibliotheca Alexandrina, built near the site of the legendary library, includes a mirror of the Internet Archive, running on 200 machines and holding 100 Terabytes total. The dangers to the archive, Kahle says, are less hardware issues like drives and boards and more things like programmer error, curator error, and governmental "error" (such as burning or shutting down the library). His goal is to install mirrors around the world in order to protect the data. You can also download the open-source software and hardware plans to make your own petabyte server, or it can be custom built and installed for about $2 Million.

May we do this?

Here's the biggest sticking point, especially with stuff published between 1964 and the Internet age. About half the books in the Library of Congress are in the public domain. Of the half that are still in copyright, most are out of print and often it's difficult to even identify who owns the copyright at this point in time. Kahle has filed a lawsuit (Kahle v. Ashcroft) asking for clarification on whether a library has the legal right to distribute in-copyright-but-out-of-print books to patrons.

The question of who owns all this data came up several times during the day, and remains one of the big unanswered questions.

Will we do this?

We're sure gonna try.

Gordon Bell and Jim Gemmell (Microsoft Research)

Personal Lifetime Storage with MyLifeBits

MyLifeBits is a living experiment that aims to digitize and store every piece of information that crosses a person's path and then hopefully learn what applications are especially useful and what issues might arise with such a system. It's what I call YAMIP (Yet Another Memex- Inspired Project) and is the same vein as RXRC's Forget-Me-Not, Georgia Tech's eClass Project, the MIT Media Lab's Wearable Computing Project and Ricoh CRC's IM³ project. In the past few years they've stored about 44 Gig of data from Gordon Bell's life, including scans of books he's written, recordings of phone calls he's made, etc. It all goes into an SQL Server database, tagged by whatever metadata and annotations they can access.

It takes a lot of dedication to pull off a living experiment like this, and I'm glad to see they're trying it. Unfortunately, I also feel like they're starting from scratch instead of building on the lessons learned from previous work and living experiments. Some of my skepticism comes from the fact that they aren't putting much effort into the user-interface issues, which to me are where the biggest unsolved problems lie. But mostly I'm not sure what, at the end of the day, we'll learn from MyLifeBits that we don't already know: that having lots of information available is useful if the cost is low enough, that making the cost low enough is a really hard interface and information-organization problem and that The Annotation Problem™ is a tough nut to crack.

A couple interesting things they have determined, at least based on their limited researcher-as-user sample set:

Users have tended to prefer persistent queries into the database instead of folders (grouping interface, one folder per item) or collections (multiple collections per item possible). This seems at odds with what customers preferred with the eCabinet, and it'll be interesting to see if it holds for non power-users.
They have a photo-screensaver that alternates between photos & 5-second-video clips (without sound) from your database, and have an interface for annotating photos on-the-fly as they come up, including simple thumbs-up/thumbs-down. Jim says he's found it to be a social activity — it'll be on in the background in his living room and he and his kids will watch it and annotate old photos. Sounds similar to stuff seen in Georgia Tech's Aware Home, though I'm not sure how the details differ.

My big questions questions about MyLifeBits and similar projects:

What user-scenario(s) do they envision? Given that scenario, how painless must it be to find and annotate data to support later retrieval? How often will the data be in the database at all?
You can presumably plot cost of storage & retrieval on one axis (diskspace, time spent, social cost, legal risk, etc.) and the number of user-scenarios where the expected value of the system will exceed that cost on the other axis. What does that curve look like? Where are we now? Are there "knees" where a little lower cost will open whole new opportunities?
Where is the biggest bottleneck / unsolved problem?
- Scanning/input?
- Getting the right metadata into the system?
- Knowing what the right metadata is when you're annotating?
- Knowing when interesting data is available?
- Amount of time/effort it takes to search for data?
- Understanding the data once you find it (out of context)?
Is it ethical for us to work on these kinds of systems until we make more progress in solving the privacy issues they raise? Contrawise, is it possible to solve the privacy issues without working on these kinds of systems so we understand what we're dealing with?

Andreas S. Weigend (Weigend Associates LLC)

People and Data: Understanding (and Influencing) Customer Data

Weigend started his talk with a simple question:

The customer pushes add to cart on Amazon.com. It's like a game. The customer made a move, now it's Amazon's move. What should they show?

The rest of his talk detailed how they experimentally answered this and similar questions by mining the logs from their million+ orders per day. Some key points from the talk:

We're seeing a paradigm shift in economics on the order of the shift from Newtonian to Quantum Physics, mostly because for the first time we're able to actually collect the huge amounts of data necessary to do real experiments. This is finally breaking down the often-assumed-even-though-it's-clearly-wrong belief in the rational consumer. As an example, he cites an experiment showing that people will both purchase less often and be less happy with their purchases when given more choices to choose from.
Models need to look at more than just a customer in isolation — for example, will this customer bring in repeat business or other customers? You can do a lot to differentiate between high- and low-value customers, from traditional discounts to putting valuable customers at the head of tech-support queues.
When you have millions of sessions per day, everything becomes statistically significant (and probably significant to the bottom line as well). For example, putting the blue "check out now" bar on the right instead of the left increases the number of purchases by about 1%, which amounts to around 10,000 additional orders per day!

You can tease out fascinating trends and cluster data, for example the fact that you find a bimodal distribution with clumps of people who buy "red state" vs. "blue state" books, with only the book What Went Wrong as a connector between them (Boing Boing had a nice post on this at the time).
On the effects of price, product group, and gender: "I won't talk about this. Let me just say, all your prejudices are true."
People have wondered how Amazon got publishers to agree to allow readers to search inside the copyrighted text of their books. The answer is simple: they ran experiments and found that book sales increase by 7% when the feature is enabled.
Wiegend reiterated the question of who owns all this customer data. For example, State Street serves as custodian to 8% or more of the World's investible assets, and have a large and profitable business selling the information on where that money goes. Is that information theirs to sell? Should it be?

Bonnie DeVarco & Paul Hansen (GeoFusion)

GeoCommunication - A New Generation of Location-Aware Visualization Tools

GeoFusion showed an impressive demo of their GIS-visualization engine (currently being licensed by ESRI for their new ArcGlobe product). They started their talk with a CG image of the Earth from space, and as they zoomed in terrain details came into view in almost (but not quite) seamless transitions. They finally stopped at what looked like a satellite photo of the IBM Almaden building where we were meeting. That part was somewhat impressive but not surprising — what surprised everyone was when they then flicked the view angle to show that what we were actually looking at was a texture-mapped 3D model of, well, the entire Earth, albeit with only a few high-resolution areas. Very impressive.

Descriptions don't really do the experience justice, nor do the static images they have on their Web site, but if you have the right kind of hardware check out their Mars Demo.

Andrew Tomkins (IBM Almaden Research Center)

IBM WebFountain: Approaches to Coping with Way Too Much Web

WebFountain is IBM's new platform for doing content analysis on large numbers of unstructured documents, including Web pages, netnews, weblogs, bulletin boards, newspapers, magazines, press releases, etc. They're set up to answer a range of questions, from detection & aggregation (Did my worldwide brand launch gain acceptance in the media?) to relationships (Who are IBM's primary partners and competitors?) to patterns & trends (Are college newspapers starting to talk more about tuition hikes than in previous years?).

Behind WebFountain are large number of data sources including the Web, BBSs, Blogs, PDF, subscriber/paid data feeds, unstructured text from public databases and customer-supplied content. These are run through a blackboard-like system that adds metadata tags such as this is an occurrence of the name of company <Foo>, and everything is indexed by a massively-parallel cluster of Linux boxes at the rate of about 500-600 documents per second.

A couple interesting points that came up during Q&A:

Feature-selection (in the pattern-recognition sense) is not yet a problem because they're still at the stage where there are obvious candidates. For example, it's obvious that things like "a company name has been mentioned" or "a person has been mentioned" will be valuable. Eventually they figure this will become more of a challenge.
Because of copyright concerns, when a Web page is updated or removed from the Web they destroy their old copy. This is a much more conservative interpretation of copyright law than the Internet Archive holds, but currently the law is simply not clear about what is and is not allowable. This means it won't be possible for WebFountain to do a retrospective analysis of the Web as a whole, though of course it would still be possible for newspaper and other media.

Deborah L. McGuinness (Stanford KSL)

Explaining Information: Increasing Reliability, Trust, and Reuse

My notes are sparse for this talk, but a very brief description is that McGuinness is developing interactive Web-based tools for maintaining ontologies over time, checking ontologies for (logical) correctness, completeness and style, merging of ontological terms from varied sources, and validation of input. The main software system she described is Chimaera, built on top of the Ontolingua Distributed Collaborative Ontology Environment.