I’ve been reading up on IBM’s recently announced WebFountain project. The system, which has been dubbed Google on steroids, spiders the Net and other databases and applies various data-mining, natural-language processing and pattern recognition techniques to the data. The current system uses 500 parallel-processing Linux boxes, all accessing about half a terabyte of storage in the basement of the IBM Almaden Research Center. IBM’s infrastructure allows clients to customize their searches and standing queries using a library that will “tokenize the data to identify people and companies, and discover patterns, trends and relationships in the data.” The technology is being offered as a service, and is already being sold through a partnership with Factiva. It is being marketed mainly for trend identification and for “reputation management,” where a company watches chat rooms, bulletin boards, newspapers and other sources to see what people are saying about it.
I’m quite interested in the technology, and even have a friend from grad school who has been working on it (Hi Dan!). But the thing that got me thinking was a comment about privacy by Robert Morris, the director of IBM Almaden. As reported in the San Jose Mercury News:
The technology could potentially raise privacy concerns if companies turned its power on analyzing individuals. But Hart and Morris said both companies would protect user privacy.
“Anything we mine is public data on the Web,” Morris said.
But it isn’t yet clear how the company would restrict users trying to use the tool to invade someone’s privacy.
The quote is in line with the comment by The Economist: “No doubt some people will say it sounds a little intrusive. But all WebFountain does is reveal information that is hidden in plain sight.”
Unfortunately, the idea that anything findable on the Net is “public” is a dodge — “public data” is a simplification of what is a much more complex set of social rules. Counter-intuitive as it may sound, privacy rules are not primarily about restricting information access to particular people. The primary purpose of privacy rules is to keep people from using the information in ways that would harm the person who is keeping it secret. This is why companies wink at sharing trade secrets with your wife or husband but are adamant about not revealing them to potential competitors, unless they’ve first signed a non-disclosure agreement. The NDA explicitly restricts harmful uses of the data, making the privacy rules unnecessary.
The idea that privacy is a restriction on power was brought home to me a few years ago by an old fraternity brother of mine. Back when he was still finishing his PhD at MIT he got a call from an MIT campus policeman, who somewhat sheepishly explained that he was calling on behalf of an irate member of the Massachusetts Maritime Police Department. Apparently this maritime policeman had been surfing the Web and had come across a picture from my friend’s undergraduate fraternity days, showing him firing water balloons from a giant funnelator. The campus policeman said he was calling to inform my friend that slingshots are illegal in Massachusetts, and that he wanted to make sure that the device had been destroyed.
So here was a picture that was clearly “public” in that it had been published for anyone to see. The intended audience was anyone who was interested in our fraternity’s annual Water War, plus anyone else who might get a chuckle out of it. You could even say the intended audience was everyone in the world except for particularly humor-impaired members of the Massachusetts Maritime Police Department. If webservers had provided such vaguely-defined access rules, we certainly would have used them.
A more realistic idea of public vs. private spaces is one of intended use, with restrictions on access as a proxy for limiting that use. When I write an article for an academic journal or even a blog entry I expect to be called upon to defend my position. When I write a LiveJournal post I expect much less criticism, and I expect that people who read my postings will be the sort of people who generally agree with me and will be accepting of whatever personal thoughts I write. Both are published on the Web, both are “public,” but different social rules are implied by the relative ease of access, ease of discovery, and the different communities that are most likely to come across my posts. Difficult access provides a kind of “soft wall” that restricts access to certain communities, and the social rules of those communities provide a soft wall that limit how my information will be used. I expect most LiveJournal users would feel violated if information from their posts wound up being used in targeted marketing literature, even though most posts aren’t password protected.
I don’t intend to slam WebFountain with this argument — WebFountain is just the latest technology that is moving soft walls around by changing the ground rules. It was also only a matter of time before such a service was be offered. As a coworker of mine has pointed out, it is almost a certainty that the NSA has already developed such technology. (The argument goes: (a) The NSA would have to be really incompetent not to have done this, and (b) the NSA is not incompetent.) Given this is likely, it seems better for society that such technology be out in the open so people can adjust their expectations about how soft those soft walls really are.