System evaluation has mattered since research on automatic language and information processing
began. However, the (D)ARPA conferences have raised the stakes substantially in
requiring and delivering systematic evaluations and in sustaining these through long term
programmes; and it has been claimed that this has both significantly raised task performance,
as defined by appropriate effectiveness measures, and promoted relevant engineering development.
These controlled laboratory evaluations have made very strong assumptions about the
task context. The paper examines these assumptions for six task areas, considers their impact
on evaluation and performance results, and argues that for current tasks of interest, e.g. summarising,
it is now essential to play down the present narrowly-defined performance measures
in order to address the task context, and specifically the role of the human participant in the
task, so that new measures, of larger value, can be developed and applied.