Book contents
- Frontmatter
- Contents
- Acknowledgements
- Glossary
- 1 Introduction
- 2 The development of web archiving
- 3 Selection
- 4 Collection methods
- 5 Quality assurance and cataloguing
- 6 Preservation
- 7 Delivery to users
- 8 Legal issues
- 9 Managing a web archiving programme
- 10 Future trends
- Appendix 1 Web archiving and preservation tools
- Appendix 2 Model permissions form
- Appendix 3 Model test script
- Appendix 4 Model issues log
- Appendix 5 Model job description
- Bibliography
- Index
- Digital Preservation
- Frontmatter
- Contents
- Acknowledgements
- Glossary
- 1 Introduction
- 2 The development of web archiving
- 3 Selection
- 4 Collection methods
- 5 Quality assurance and cataloguing
- 6 Preservation
- 7 Delivery to users
- 8 Legal issues
- 9 Managing a web archiving programme
- 10 Future trends
- Appendix 1 Web archiving and preservation tools
- Appendix 2 Model permissions form
- Appendix 3 Model test script
- Appendix 4 Model issues log
- Appendix 5 Model job description
- Bibliography
- Index
- Digital Preservation
Summary
Introduction
This chapter describes various possible methods of collecting websites for archival purposes. The variety of approaches is dictated by the nature of web technology itself. This chapter therefore begins with a summary of website technology, before describing the various collection methods in detail. The strengths and limitations of each method are also considered. The design of a website can be an important factor in determining the ease with which it can be collected, and the range of methods appropriate. This chapter therefore also considers how webmasters can create ‘archive-friendly’ websites.
The technology of the web
The experience of using the world wide web arises from the interplay between two fundamental components – the web server and the web client, such as a web browser. A web server stores content, such as HTML pages and images, which it delivers, or ‘serves’, to a web browser in response to requests from that browser. A web browser requests content from web servers, and renders that received content for the user. The interaction between these two components is therefore as significant as the components themselves. Some form of communications protocol provides the mechanism by which this interaction takes place. The protocol defines a standard format for communications between the server and the browser. The most commonly used protocol on the web is the hypertext transfer protocol (HTTP). Thus, when a browser sends a request to a server, that request takes the form of an HTTP ‘message’, as does the reply from the server.
All the content available on a web server is identified using a uniform resource locator (URL) – a reference which describes where on the web that content is located (see Chapter 3, Selection, for a more detailed discussion of URLs). The nature of URLs is one of the defining characteristics of the web, and creates a very indirect relationship between browsers and servers. Neither the browser nor the server need to know anything about each other, beyond the information contained within the HTTP message. Thus, a browser requests content by sending an HTTP request containing the relevant URL.
- Type
- Chapter
- Information
- Archiving Websitesa practical guide for information management professionals, pp. 42 - 68Publisher: FacetPrint publication year: 2006