We consider the following problem. For a given population of m items, we have to make a decision whether or not the population includes a relatively large cluster of identical items. This decision affects the effectiveness of a subsequent computational process, depending on the actual existence of the cluster and its size. To make a good decision, we use a statistical sample which should indicate the existence of a cluster and find a representative thereof. This paper describes the optimal sampling technique to be used in such a case, given the cost of the sampling and the potential gain in speed of the subsequent process. The optimal fixed sample size is specified, as well as the optimal sequential sampling, along with characterizing the dependence of the cost function on the truncation point.
For the case that the a priori distribution of the cluster proportion is known, we present formulae by which the optimal sampling procedures can be easily calculated. For the common situation in which the a priori distribution is not known, we present, in the case of a fixed sample size, a tight upper bound for the sample size, which is independent of the a priori distribution, and for the case of the sequential sampling, we present an approximately optimal truncation point, which is also independent of the a priori distribution.
The situation described arose in connection with choosing the best sorting method, an application that will be described in full detail. The most interesting practical result is that for our application truncating the sequential procedure at 35 observations, out of a population of 25,000–30,000 items, guarantees that in our sorting application we are always within 2.1% of the optimal cost independently of the a priori distribution.