Afterlife for web pages


in Geek stuff, Internet matters

One research tool that surprisingly few people seem to know about is the Wayback Machine, at the Internet Archive. If you are looking for the old corporate homepage of the disbanded mercenary firm Executive Outcomes, or want to see something that used to be posted on a governmental site, but is no longer available there, it is worth a try.

Obviously, they cannot archive everything that is online, but the collection is complete enough to have helped out more than a couple of my friends. People who operate sites may also be interested in having a look at what data of yours they have collected.

{ 8 comments… read them below or add one }

Milan March 7, 2007 at 11:00 pm

Eight hours after writing this post, I saw that the OUP blog has a post that involves the Wayback Machine.

It is really just a link to this video.

It is all a bit over-done, but still informative.

Anonymous March 7, 2007 at 11:08 pm

Sorry, but that is a lot overdone. The first third is tolerable, the remainder indefensible.

. February 14, 2009 at 11:52 pm

Blackwater drops tarnished name

The Associated Press

February 13, 2009 at 1:28 PM EST

RALEIGH, N.C. — Blackwater Worldwide is abandoning its tarnished brand name as it tries to shake a reputation battered by oft-criticized work in Iraq, renaming its family of two dozen businesses under the name Xe.

The parent company’s new name is given the U.S. pronunciation of the letter “z.” Blackwater Lodge & Training Centre — the subsidiary that conducts much of the company’s overseas operations and domestic training — has been renamed U.S. Training Centre Inc., the company said Friday.

The decision comes as part of an ongoing rebranding effort that grew more urgent following a September, 2007, shooting in Iraq that left at least a dozen civilians dead. Blackwater president Gary Jackson said in a memo to employees the new name reflects the change in company focus away from the business of providing private security.

. November 2, 2010 at 10:21 am

Archiving the web
Born digital
National libraries start to preserve the web, but cannot save everything

Oct 21st 2010

IN THE digital realm, things seem always to happen the wrong way round. Whereas Google has hurried to scan books into its digital catalogue, a group of national libraries has begun saving what the online giant leaves behind. For although search engines such as Google index the web, they do not archive it. Many websites just disappear when their owner runs out of money or interest. Adam Farquhar, in charge of digital projects for the British Library, points out that the world has in some ways a better record of the beginning of the 20th century than of the beginning of the 21st.

In 1996 Brewster Kahle, a computer scientist and internet entrepreneur, founded the Internet Archive, a non-profit organisation dedicated to preserving websites. He also began gently harassing national libraries to worry about preserving the web. They started to pay attention when several elections produced interesting material that never touched paper.

In 2003 eleven national libraries and the Internet Archive launched a project to preserve “born-digital” information: the kind that has never existed as anything but digitally. Called the International Internet Preservation Consortium (IIPC), it now includes 39 large institutional libraries. But the task is impossible. One reason is the sheer amount of data on the web. The groups have already collected several petabytes of data (a petabyte can hold roughly 10 trillion copies of this article).

Another issue is ensuring that the data is stored in a format that makes it available in centuries to come. Ancient manuscripts are still readable. But much digital media from the past is readable only on a handful of fragile and antique machines, if at all. The IIPC has set a single format, making it more likely that future historians will be able to find a machine to read the data. But a single solution cannot capture all content. Web publishers increasingly serve up content-rich pages based on complex data sets. Audio and video programmes based on proprietary formats such as Windows Media Player are another challenge. What happens if Microsoft is bankrupt and forgotten in 2210?

. November 29, 2016 at 4:08 pm
. November 29, 2016 at 4:09 pm

” The Internet Archive is augmenting its existing mirrors — one in San Francisco, one in Amsterdam, one at the Library of Alexandria (that is: San Andreas fault, below sea level, military dictatorship) — with a copy in Canada, on the premise that “lots of copies keep stuff safe.”

Canada is hardly a paragon of freedom. The new guy looks great with his shirt off (so does Putin), but he also rescued the old Prime Minister’s “Patroit Act fanfic” surveillance bill and broke his promise to fix it after the election.

But Internet Archive founder Brewster Kahle is right: lots of copies are better. If Canada tries to censor the Internet Archive, it probably won’t go after the same stuff as Trump, nor at the same time. More copies are better.”

. April 23, 2017 at 6:12 pm

Robots (or spiders, or crawlers) are little computer programs that search engines use to scan and index websites. Robots.txt is a little file placed on webservers to tell search engines what they should and shouldn’t index. The Internet Archive isn’t a search engine, but has historically obeyed exclusion requests from robots.txt files. But it’s changing its mind, because robots.txt is almost always crafted with search engines in mind and rarely reflects the intentions of domain owners when it comes to archiving.

An excellent decision. To be clear, they’re ignoring robots.txt even if you explicitly identify and disallow the Internet Archive. It’s a splendid remember that nothing published on the web is ever meaningfully private, and will always go on your permanent record.

. October 8, 2018 at 4:56 pm

Leave a Comment

Previous post:

Next post: