On digitized books


in Books and literature, Geek stuff, Internet matters

For years, Project Gutenberg and related endeavours have been seeking to produce digital copies of books that are no longer under copyright. The Gutenberg people have already digitized 17,000. Purposes for doing so include making machine-readable copies available for those with disabilities, allowing for their use with e-book readers, and even in more creative applications – like printing books onto scarves, so that you can read them on flights from the UK to the United States.

In the grand tradition of huge companies incorporating the results of smaller enterprises, many (if not all) of the Gutenberg books are now available through Google Book Search. Figuring out which Jane Austen book a particular passage stuck in your memory is from has thus become a far simpler task. For years, I have been using The Complete Works of William Shakespeare, provided by MIT, to search through plays.

Admittedly, not many people want to sit in front of a monitor to read an entire book. With the development of electronic paper that has high resolution, high contrast, and no requirement for power consumption while displaying static information, perhaps this will all become a whole lot more useful.

Report a typo or inaccuracy

{ 6 comments… read them below or add one }

Milan September 1, 2006 at 10:51 am

Books Online (via Bookworm)

“The preeminent Internet publisher of literature, reference and verse providing students, researchers and the intellectually curious with unlimited access to books and information on the web, free of charge.” Includes the full text of the Harvard Classics.

“Classic Christian books in electronic format, selected for your edification,” including works by the Church Fathers.

“The Internet’s oldest producer of FREE electronic books (eBooks or eTexts).”

“From the ancient classics to the masterpieces of the 20th century, the Great Books are all the introduction you’ll ever need to the ideas, stories and discoveries that have shaped modern civilization.”

“Listing over 20,000 free books on the Web”

Sylvia September 2, 2006 at 5:05 pm

That scarf idea is brilliant, even if only symbolic. I can’t imagine a whole book would fit on a scarf. Maybe a sari…

Anonymous September 2, 2006 at 7:00 pm


Some poetry might be just the thing for a scarf. Especially if it was done in nice calligraphy.

. September 1, 2010 at 6:03 pm

“There has been a rage of attention to the recently revised proposal for a settlement by Google of a lawsuit brought against it by the Authors Guild of America and the Association of American Publishers (AAP). In 2004, Google launched the sort of project that only Internet idealists such as the entrepreneur and archivist Brewster Kahle had imagined: to scan eighteen million books, and make those books accessible on the Internet. How accessible depended upon the type of book. If the book was in the public domain, then Google would give you full access, and even permit you to download a digital copy of the book for free. If the book was presumptively under copyright, then at a minimum Google would grant “snippet access” to the work, meaning you could see a few lines around the words you searched, and then would be given information about where you could buy or borrow the book. But if the work was still in print, then publishers could authorize Google to make available as much of the book (beyond the snippets) as the publishers wanted.

The Authors Guild and AAP claimed that this plan violated copyright law. Their argument was simple and obvious–at least in the autistic sort of way that copyright law thinks about digital technology: when Google scanned the eighteen million books to build its index, it made a “copy” of them. For works still under copyright, the plaintiffs argued, this meant that Google needed permission from the copyright owner before that scan could occur. Never mind that Google scanned the works simply to index them; and never mind that it would never–without permission–distribute whole or even usable copies of the copyrighted works (except to the original libraries as replacements for lost physical copies). According to the plaintiffs, permission was vital, legally. Without it, Google was a pirate.

For 16 percent of the eighteen million books, the plaintiffs’ charges were no problem: these were works in the public domain. The law assured Google the free right to copy them. Likewise for the 9 percent that were still in print: for these too, it was relatively easy to identify who to ask before scanning was to happen. Publishers were delighted to assure this simple and cheap marketing for published works (practically all had signed up for the service before Google announced Google Book Search). But for 75 percent of the eighteen million books in our libraries, the rule of the plaintiffs would have been a digital death sentence. For these works–presumptively under copyright but no longer in print–to require permission first is to guarantee invisibility. These works are, practically speaking, orphans. It is effectively impossible–at least at the wholesale level–to secure permission for any use that triggers copyright law.

Google maintained–rightly, in my view–that its “use” of these copyrighted works (copying them so as to index them, and then simply enabling a search on that index) was “fair use.” That meant it needed no one’s permission before it scanned them, so long as its use was sufficiently transformative. But had Google lost the argument–and courts have been known to reach the wrong conclusion in copyright cases–then the company faced crippling liability.

So when it was given a chance to settle, it is no surprise Google took it (though Google insiders insist that fear of liability was not a motive). To its great credit, Google did not back off its claim that its use would have been a “fair use.” And even better, it secured from the plaintiffs and for the public a better deal than what “fair use” would have given it and the public. Under the settlement, Google would pay for the right to make up to 20 percent of copyrighted books whose author could not be found available to the public for free; and beyond 20 percent, the public could pay to access the full book, with the funds given over to a new non-profit charged with getting these royalties to the authors who want them. We get one-fifth of all the orphans (or one-fifth of each orphan) for free. And Google got the chance to build an eighteen-million-book digital library.

There is much to praise in this settlement. Lawsuits are expensive and uncertain. They take years to resolve. The deal Google struck guaranteed the public more free access to free content than “fair use” would have done. Twenty percent is better than snippets, and a system that channels money to authors is going to be liked much more than a system that does not. (Not to mention that the deal is elegant and clever in ways that a contracts professor can only envy.)”

. September 1, 2010 at 6:19 pm

“But whether authors are happy or not, it is critical to recognize that the free access that this world created was an essential part of how we passed our culture along. When you send your children to a library to write a research paper, you do not want them to have access to just 20 percent of each book they need to read. You want them to be able to read all of the book. And you do not want them to read just the books they think they would be willing to pay to access. You want them to browse: to explore, to wonder, to ask questions–the way, for example, people explore and wonder and ask questions using Google or Wikipedia. We had a culture where an enormous chunk of cultural life was proliferated and shared without most of us ever calling a copyright lawyer. Whether authors (or more likely, publishers) liked it or not, that was our fortunate past.

We are about to change that past, radically. And the premise for that change is an accidental feature of the architecture of copyright law: that it regulates copies. In the physical world, this architecture means that the law regulates a small set of the possible uses of a copyrighted work. In the digital world, this architecture means that the law regulates everything. For every single use of creative work in digital space makes a copy. Thus–the lawyer insists–every single use must in some sense be licensed. Even the scanning of a book for the purpose of generating an index–the action at the core of the Google book case–triggers the law of copyright, because that scanning, again, produces a copy.

And what this means, or so I fear, is that we are about to transform books into documentary films. The legal structure that we now contemplate for the accessing of books is even more complex than the legal structure that we have in place for the accessing of films. Or more simply still: we are about to make every access to our culture a legally regulated event, rich in its demand for lawyers and licenses, certain to burden even relatively popular work. Or again: we are about to make a catastrophic cultural mistake.”

. April 25, 2017 at 4:45 pm

Every weekday, semi trucks full of books would pull up at designated Google scanning centers. The one ingesting Stanford’s library was on Google’s Mountain View campus, in a converted office building. The books were unloaded from the trucks onto the kind of carts you find in libraries and wheeled up to human operators sitting at one of a few dozen brightly lit scanning stations, arranged in rows about six to eight feet apart.

The stations—which didn’t so much scan as photograph books—had been custom-built by Google from the sheet metal up. Each one could digitize books at a rate of 1,000 pages per hour. The book would lie in a specially designed motorized cradle that would adjust to the spine, locking it in place. Above, there was an array of lights and at least $1,000 worth of optics, including four cameras, two pointed at each half of the book, and a range-finding LIDAR that overlaid a three-dimensional laser grid on the book’s surface to capture the curvature of the paper. The human operator would turn pages by hand—no machine could be as quick and gentle—and fire the cameras by pressing a foot pedal, as though playing at a strange piano.

What made the system so efficient is that it left so much of the work to software. Rather than make sure that each page was aligned perfectly, and flattened, before taking a photo, which was a major source of delays in traditional book-scanning systems, cruder images of curved pages were fed to de-warping algorithms, which used the LIDAR data along with some clever mathematics to artificially bend the text back into straight lines.

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

{ 1 trackback }

Previous post:

Next post: