Afterlife for web pages

One research tool that surprisingly few people seem to know about is the Wayback Machine, at the Internet Archive. If you are looking for the old corporate homepage of the disbanded mercenary firm Executive Outcomes, or want to see something that used to be posted on a governmental site, but is no longer available there, it is worth a try.

Obviously, they cannot archive everything that is online, but the collection is complete enough to have helped out more than a couple of my friends. People who operate sites may also be interested in having a look at what data of yours they have collected.

Author: Milan

In the spring of 2005, I graduated from the University of British Columbia with a degree in International Relations and a general focus in the area of environmental politics. In the fall of 2005, I began reading for an M.Phil in IR at Wadham College, Oxford. Outside school, I am very interested in photography, writing, and the outdoors. I am writing this blog to keep in touch with friends and family around the world, provide a more personal view of graduate student life in Oxford, and pass on some lessons I've learned here.

8 thoughts on “Afterlife for web pages”

  1. Eight hours after writing this post, I saw that the OUP blog has a post that involves the Wayback Machine.

    It is really just a link to this video.

    It is all a bit over-done, but still informative.

  2. Sorry, but that is a lot overdone. The first third is tolerable, the remainder indefensible.

  3. Blackwater drops tarnished name

    The Associated Press

    February 13, 2009 at 1:28 PM EST

    RALEIGH, N.C. — Blackwater Worldwide is abandoning its tarnished brand name as it tries to shake a reputation battered by oft-criticized work in Iraq, renaming its family of two dozen businesses under the name Xe.

    The parent company’s new name is given the U.S. pronunciation of the letter “z.” Blackwater Lodge & Training Centre — the subsidiary that conducts much of the company’s overseas operations and domestic training — has been renamed U.S. Training Centre Inc., the company said Friday.

    The decision comes as part of an ongoing rebranding effort that grew more urgent following a September, 2007, shooting in Iraq that left at least a dozen civilians dead. Blackwater president Gary Jackson said in a memo to employees the new name reflects the change in company focus away from the business of providing private security.

  4. Archiving the web
    Born digital
    National libraries start to preserve the web, but cannot save everything

    Oct 21st 2010

    IN THE digital realm, things seem always to happen the wrong way round. Whereas Google has hurried to scan books into its digital catalogue, a group of national libraries has begun saving what the online giant leaves behind. For although search engines such as Google index the web, they do not archive it. Many websites just disappear when their owner runs out of money or interest. Adam Farquhar, in charge of digital projects for the British Library, points out that the world has in some ways a better record of the beginning of the 20th century than of the beginning of the 21st.

    In 1996 Brewster Kahle, a computer scientist and internet entrepreneur, founded the Internet Archive, a non-profit organisation dedicated to preserving websites. He also began gently harassing national libraries to worry about preserving the web. They started to pay attention when several elections produced interesting material that never touched paper.

    In 2003 eleven national libraries and the Internet Archive launched a project to preserve “born-digital” information: the kind that has never existed as anything but digitally. Called the International Internet Preservation Consortium (IIPC), it now includes 39 large institutional libraries. But the task is impossible. One reason is the sheer amount of data on the web. The groups have already collected several petabytes of data (a petabyte can hold roughly 10 trillion copies of this article).

    Another issue is ensuring that the data is stored in a format that makes it available in centuries to come. Ancient manuscripts are still readable. But much digital media from the past is readable only on a handful of fragile and antique machines, if at all. The IIPC has set a single format, making it more likely that future historians will be able to find a machine to read the data. But a single solution cannot capture all content. Web publishers increasingly serve up content-rich pages based on complex data sets. Audio and video programmes based on proprietary formats such as Windows Media Player are another challenge. What happens if Microsoft is bankrupt and forgotten in 2210?

  5. ” The Internet Archive is augmenting its existing mirrors — one in San Francisco, one in Amsterdam, one at the Library of Alexandria (that is: San Andreas fault, below sea level, military dictatorship) — with a copy in Canada, on the premise that “lots of copies keep stuff safe.”

    Canada is hardly a paragon of freedom. The new guy looks great with his shirt off (so does Putin), but he also rescued the old Prime Minister’s “Patroit Act fanfic” surveillance bill and broke his promise to fix it after the election.

    But Internet Archive founder Brewster Kahle is right: lots of copies are better. If Canada tries to censor the Internet Archive, it probably won’t go after the same stuff as Trump, nor at the same time. More copies are better.”

  6. Robots (or spiders, or crawlers) are little computer programs that search engines use to scan and index websites. Robots.txt is a little file placed on webservers to tell search engines what they should and shouldn’t index. The Internet Archive isn’t a search engine, but has historically obeyed exclusion requests from robots.txt files. But it’s changing its mind, because robots.txt is almost always crafted with search engines in mind and rarely reflects the intentions of domain owners when it comes to archiving.

    An excellent decision. To be clear, they’re ignoring robots.txt even if you explicitly identify and disallow the Internet Archive. It’s a splendid remember that nothing published on the web is ever meaningfully private, and will always go on your permanent record.

Leave a Reply

Your email address will not be published. Required fields are marked *