Media Technology

What about a Google cache on my desk?

Yesterday I said that within a decade disk space should be cheap enough to put the entire visible web on your desk for under $1000. I think that’s actually a pretty conservative estimate, since it assumes a 100 KB average page size, up to an order of magnitude higher than some estimates.

Here’s another back-of-the envelope: let’s say we wanted the equivalent of Google’s webcache on your desktop (that is, all the HTML but no images). Another way to calculate it starts with the fact that the 2003 update to Berkeley’s How Much Info? study estimated that in 2002 the web was only 167 Terabytes total, with only 30 TB as HTML (69 TB when you include images). Assuming 75% compression, that’s just around 8 TB. That same year a 2002 OCLC study calculated that the total number of web pages was only increasing by about 5% per year (with the number of sites actually shrinking, but the number of pages per site growing). That rate had been decreasing ever since the explosion in the mid ’90s, but let’s assume growth became a steady 5% and will stay at that rate for the next few years. (There are a lot of assumptions going on here, but the nice thing about these kinds of curves is that even if my numbers are off by a factor of two somewhere, so long as disk keeps increasing at the same rate that crossover point only changes by one year.)

Now we’ve got two trends, and just need to find the intersection point for the price we want:

Year Price of 1 TB disk Size of public web
(compressed HTML only,
assumes 5% growth/year)
Cost to store
2002 8 TB
2003 8.5 TB
2004 8.8 TB
2005 $500 9.25 TB $4,625
2006 $250 9.7 TB $2,425
2007 $125 10.2 TB $1,275
2008 $62.50 10.7 TB $670
2009 $31.25 11.25 TB $350
2010 $15.50 11.8 TB $185

So given a few assumptions, we’ll be able to cache all the raw text on the public web for under $1000 (disk cost) within 3 years!

What about a Google cache on my desk? Read More »

When do I get the web in my pocket?

Some time ago I asked how much longer before I can have the Web in my pocket. Let’s try a quick back-of-the-envelope calculation:

A paper from January 2005 calculates the publicly indexable Web (the part easily accessible to search engine web-crawlers) as being around 11.5 billion pages. Estimates on average webpage size seem to be all over the map, but let’s figure around 100 KB per page, for a total of around a petabyte (one million Gig) for today’s indexed web. (I’m assuming text and images, but ignoring other media.)

Disk these days is going for less than 50 cents per Gig, so enough disk to store your own personal Google (and then some) costs around $500,000. With compression you can probably cut that in half. The price of disk is also falling by a factor of two every 12 months, so assuming no major jumps or snags in the disk-price curve, in a little less than a decade we can expect to hold the equivalent of today’s indexed web for less than $1000.

Now of course, in that time the web will continue to grow, so we may no longer be satisfied with our measly petabyte-on-the-desk, but I figure the amount of human-generated Web content has a much slower growth rate than our disk-space curve. The number of web sites actually shrank between 2001 and 2002, and though it now seems to be growing again there’s only so much content that human beings can create in a day. The real question I have is whether in a decade anyone will see having access to the whole web as being all that interesting — I could easily see the majority of people losing interest in the surface web in favor of personal deep-web niches. The only reason I want the whole web in my pocket is because it’s too hard for me to filter out in advance the 99.99% of the web that’ll never be of interest to me — the closer we get to that kind of pruning, the less disk we need and the higher-quality the experience will be.

Update 8/2/05: doing a different back-of-the-envelope estimate leads to being able to store a compressed-HTML cache (no images) on less than $1000 worth of disk within 3 years…

When do I get the web in my pocket? Read More »

Microsoft giving grants for Personal Lifetime Storage projects

Microsoft Research has announced a Request for Proposals for projects in relating to their Digital Memories (Memex) research kit, in the context of “personal lifetime storage.” Microsoft’s inspiration (and probably the inspiration for everyone else working in this area too, at least indirectly) is Vannevar Bush’s 1945 article As We May Think, in which he famously described a kind of personal library-in-a-desk he called the memex:

Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and to coin one at random, “memex” will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.

MSR expects to give 6-9 awards to college and university projects, up to a max of $50,000 per award, and recipients would also be given a SenseCam wearable camera and software from the MyLifeBits, VIBE and Phlat research projects at Microsoft Research. Strings are minimal — they expect semiannual progress reports, want it presented at at least one of their workshops and expect the project to be either dedicated to the public domain or released under an open license such as the BSD license.

Microsoft giving grants for Personal Lifetime Storage projects Read More »

New digital cinema standard to use JPEG 2000 compression

Yesterday a consortium of the major movie studios announced final specs for a new standard digital format for movie theaters. The specification uses JPEG 2000 video compression, which (though it happened before I started working there) I’m proud to say largely came out of work performed at my lab.

The big advantage of JPEG 2000 is that you can “pull out” bits from a code stream to get different resolutions — in this case a 4K distribution (1,302,083 bytes per frame at 48FPS) and a 2K distribution (651,041 bytes per frame at 48 FPS) can both be generated on-the-fly from the same file, just by discarding segments of the stream.

(Thanks to Mike for the link.)

New digital cinema standard to use JPEG 2000 compression Read More »

Fujutsu shows bendable e-paper display

I’m a little late on this news, but last week Fujitsu announced a new bendable e-paper technology. EE Times has the most complete technical description I’ve seen on it:

The display is a passive-matrix, reflective type cholesteric liquid crystal display. Two 3.8-inch diagonal QVGA prototypes, a monochrome display and a color version able to display 512 colors, were shown.

Differing from widely used flat displays that have color filters consisting of red, green and blue pixels, the paper display has a three layered structure in total about 0.8 mm thick. One layer consists of two 0.125 mm-thick films sandwiching liquid crystal. Cholesteric crystals in each layer are twisted in a certain pitch to reflect only red, green or blue light respectively.

Images on the screen can be changed with 10-milliwatts to 100-milliwatts depending on scanning speed.

(Thanks to John for the link…)

Fujutsu shows bendable e-paper display Read More »

Google Maps & Virtual Earth

Microsoft and Google both come out with new versions of free online satellite mapping software this past weekend. Google Maps has added the “hybrid view” that lets you see your driving directions laid out on the map itself, which is the feature I’ve wanted ever since they came out with satellite-view. Microsoft has just released their web-based Virtual Earth, which doesn’t yet support driving directions (coming soon I’m sure) but does include a nice (dare I say “Google Maps-like”?) scrollable interface and switching between maps and aerial photography. They’ve an interface for keeping track of multiple locations on a scratch pad, an API for adding your own way-points on the URL line, and a cute zoom-in animation.

One fun feature of Virtual Earth is that some parts of the US have incredible resolution: compare Seattle’s space needle from Virtual Earth and Google Maps to see what I mean. Unfortunately, Virtual Earth’s image coverage is pretty spotty. In spite of the name, it only covers the USA — I’m guessing they’re just using USGS publicly available images right now. Also, for many areas they’re using very old black-and-white images that they’ve then overlaid with color for roads and parks. This leads to a few embarrassing misses like the fact that their map shows Apple Computer’s Corporate HQ has yet to be built (I didn’t see any horse-and-buggies on the streets though, so it can’t be too old).

Google Maps & Virtual Earth Read More »

Jargon watch

According to a Data Memo by the Pew Internet & American Life Project, 29% of online Americans have a good idea what phishing means, 13% what podcasting is, and only 9% know what RSS feeds are. Over half knew the terms adware, internet cookies, spyware, firewall and spam.

Of course, the real question in my mind isn’t whether people know what phishing means, but whether they know that regardless of what it’s called the 22 “You must change your PayPal password!” emails they have in their inbox are attempts at fraud. Still, it’ll be interesting to see how these terms spread in the next six months or so.

(Thanks to Rowan for the link.)

Jargon watch Read More »

Japanese graphical search engine

MarsFlag is a new search engine in Japan (went live in March) that provides links as thumbnail images of returned results instead of text, with larger-version pop-ups when you rollover with the mouse. Supports full-text search (e.g. this search on wearable computer) as well as pictorial topic areas like movies, fashion magazines and motercycles.

According to Internet Watch [JP → EN], the search engine at least in part determines results ranking using bookmarks kept by the 35,000 subscribers to the Mark Agent web-based bookmarking service that the company also owns. MarsFlag claims this helps thwart attempts to gain page rank by creating link farms, a process called search engine optimization. (Presumably that’ll only work until SEO companies start generating fake Mark Agent accounts…)

Japanese graphical search engine Read More »

Panasonic shows off color eBook prototype

Tech-On reports that Matsushita (Panasonic) showed off a new prototype color eBook reader with a 5.6″, 210 points-per-inch display at the NE Technology Summit 2005 event held in Tokyo yesterday. Given that their current grayscale Σ book uses a bistable display made by Kent Displays, I would hazard a guess that their prototype is using Kent’s new color ChLCD display (but that’s just a guess). Bistable displays like the ChLCD and eInk‘s microcapsule display (used in the Sony LIBRIé) take power to change an image but not to maintain it, so they’re incredibly low power for low-framerate applications like eBooks.

(via engadget by way of Steve)

Panasonic shows off color eBook prototype Read More »