Collection methods

Adrian Brown

doi:10.29085/9781856049009.005

Introduction

This chapter describes various possible methods of collecting websites for archival purposes. The variety of approaches is dictated by the nature of web technology itself. This chapter therefore begins with a summary of website technology, before describing the various collection methods in detail. The strengths and limitations of each method are also considered. The design of a website can be an important factor in determining the ease with which it can be collected, and the range of methods appropriate. This chapter therefore also considers how webmasters can create ‘archive-friendly’ websites.

The technology of the web

The experience of using the world wide web arises from the interplay between two fundamental components – the web server and the web client, such as a web browser. A web server stores content, such as HTML pages and images, which it delivers, or ‘serves’, to a web browser in response to requests from that browser. A web browser requests content from web servers, and renders that received content for the user. The interaction between these two components is therefore as significant as the components themselves. Some form of communications protocol provides the mechanism by which this interaction takes place. The protocol defines a standard format for communications between the server and the browser. The most commonly used protocol on the web is the hypertext transfer protocol (HTTP). Thus, when a browser sends a request to a server, that request takes the form of an HTTP ‘message’, as does the reply from the server.

All the content available on a web server is identified using a uniform resource locator (URL) – a reference which describes where on the web that content is located (see Chapter 3, Selection, for a more detailed discussion of URLs). The nature of URLs is one of the defining characteristics of the web, and creates a very indirect relationship between browsers and servers. Neither the browser nor the server need to know anything about each other, beyond the information contained within the HTTP message. Thus, a browser requests content by sending an HTTP request containing the relevant URL.

Book contents

4 - Collection methods

Summary

Access options

Book contents

4 - Collection methods

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive