Information retrieval (IR) systems are often contrasted with relational databases. Traditionally, IR systems have retrieved information from unstructured text – by which we mean “raw” text without markup. Databases are designed for querying relational data, sets of records that have values for predefined attributes such as employee number, title, and salary. There are fundamental differences between IR and database systems in terms of retrieval model, data structures, and query language as shown in Table 10.1.
Some highly structured text search problems are most efficiently handled by a relational database; for example, if the employee table contains an attribute for short textual job descriptions and you want to find all employees who are involved with invoicing. In this case, the SQL query:
select lastname from employees where job_desc like ‘invoic%’;
may be sufficient to satisfy your information need with high precision and recall.
However, many structured data sources containing text are best modeled as structured documents rather than relational data. We call the search over such structured documents structured retrieval. Queries in structured retrieval can be either structured or unstructured, but we assume in this chapter that the collection consists only of structured documents. Applications of structured retrieval include digital libraries, patent databases, blogs, text in which entities like persons and locations have been tagged (in a process called named entity tagging), and output from office suites like OpenOffice that save documents as marked up text.