Skip to main content Accessibility help
×
Hostname: page-component-5c6d5d7d68-wtssw Total loading time: 0 Render date: 2024-08-06T18:21:57.525Z Has data issue: false hasContentIssue false

2 - The development of web archiving

Published online by Cambridge University Press:  08 June 2018

Get access

Summary

Introduction

The history of web archiving is almost as long as that of the web itself. There are grounds for optimism in noting that, in a world where modern technological innovations, such as the first e-mail, have all too often been lost forever, the very first website is still preserved: that initial page of text and hyperlinks created in 1991 (see Chapter 1, Introduction) can still be viewed and navigated today. The first notable web archiving initiative was also, and remains to this day, the most ambitious. The Internet Archive was established in 1996, with the mission statement of ‘universal access to all human knowledge’. Since then, web archiving has rapidly evolved to become an international, multidisciplinary concern, spawning a multitude of research- and practically based programmes. This chapter describes some of the major milestones along that road, from the very first web archiving initiatives to the latest international research.

Initiation: the Internet Archive

The origins of the Internet Archive lie in Alexa Internet, a web cataloguing company founded by Brewster Kahle and Bruce Gilliat in 1996. At the same time, Kahle established the Internet Archive as a non-profit organization, with the aim of building a digital library to offer permanent access to historical collections which exist in digital form. Alexa harvests a snapshot of the world wide web every two months, each snapshot encompassing over 35 million websites. These snapshots are donated to the Internet Archive, and form the basis of its main collection.

Located in San Francisco, the Internet Archive is undoubtedly the largest web archive in the world. The raw statistics make impressive reading: as of 2005, the archive contained over 40 billion web pages; the total collection amounted to over one petabyte (1000 terabytes) of data, and was increasing at a rate of 20 terabytes per month.

The Internet Archive collects material by remote harvesting, the method used by the vast majority of web archiving programmes at present. This uses specialized web crawling software to copy web resources remotely, and is described in more detail in Chapter 4, Collection Methods. To date, these harvests have been collected using Alexa Internet's own, proprietary web crawling software, which is not available for direct use by the Internet Archive or other organizations.

Type
Chapter
Information
Archiving Websites
a practical guide for information management professionals
, pp. 8 - 23
Publisher: Facet
Print publication year: 2006

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×