The development of web archiving

Adrian Brown

doi:10.29085/9781856049009.003

Introduction

The history of web archiving is almost as long as that of the web itself. There are grounds for optimism in noting that, in a world where modern technological innovations, such as the first e-mail, have all too often been lost forever, the very first website is still preserved: that initial page of text and hyperlinks created in 1991 (see Chapter 1, Introduction) can still be viewed and navigated today. The first notable web archiving initiative was also, and remains to this day, the most ambitious. The Internet Archive was established in 1996, with the mission statement of ‘universal access to all human knowledge’. Since then, web archiving has rapidly evolved to become an international, multidisciplinary concern, spawning a multitude of research- and practically based programmes. This chapter describes some of the major milestones along that road, from the very first web archiving initiatives to the latest international research.

Initiation: the Internet Archive

The origins of the Internet Archive lie in Alexa Internet, a web cataloguing company founded by Brewster Kahle and Bruce Gilliat in 1996. At the same time, Kahle established the Internet Archive as a non-profit organization, with the aim of building a digital library to offer permanent access to historical collections which exist in digital form. Alexa harvests a snapshot of the world wide web every two months, each snapshot encompassing over 35 million websites. These snapshots are donated to the Internet Archive, and form the basis of its main collection.

Located in San Francisco, the Internet Archive is undoubtedly the largest web archive in the world. The raw statistics make impressive reading: as of 2005, the archive contained over 40 billion web pages; the total collection amounted to over one petabyte (1000 terabytes) of data, and was increasing at a rate of 20 terabytes per month.

The Internet Archive collects material by remote harvesting, the method used by the vast majority of web archiving programmes at present. This uses specialized web crawling software to copy web resources remotely, and is described in more detail in Chapter 4, Collection Methods. To date, these harvests have been collected using Alexa Internet's own, proprietary web crawling software, which is not available for direct use by the Internet Archive or other organizations.

Book contents

2 - The development of web archiving

Summary

Access options

Book contents

2 - The development of web archiving

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive