Search

2 - Programming and statistical concepts
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp 20-58
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

History
In the late nineteenth and early twentieth century, computing devices became more powerful when they were provided with memory in which to store the data used in the computation. This memory could also store the intermediate results of computation for later use. In the middle of the twentieth century, a major breakthrough in the design of computers occurred when scientists realized that the memory of a computing device could also store the computational instructions themselves. This enabled programmers to compose computational instructions that could be executed automatically, without human intervention.
The instructions that a computer can execute directly need to be very detailed; they are tedious and time consuming to compose, difficult to read even for experienced programmers, prone to errors, and hard to correct. In the late 1950s, some people in the IBM Corporation realized that they could write a computer program that could read statements in a language in which a computer programmer could express his or her intent more efficiently, and then translate those statements into the detailed instructions that a computer can execute directly. We call such a translating program a compiler. Before a compiler can be written, the programming language that it translates, i.e., vocabulary, syntax, and grammar, must be explicitly specified. The program that a compiler translates is called the source statements or source code. When a compiler executes to translate source statements it is said to compile the source statements. The result of a compilation is the object code. Object code is another computer program written with the instructions that the computer can execute directly. When a programmer writes a program in a source language, first source statements are composed using an editor, next those source statements are compiled into object code, and finally that object code is executed to carry out the intent of the programmer. We often use the simpler term, run, to mean execute.

References
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp 253-255
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Acknowledgments
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp vii-viii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

A Computational Approach to Statistical Arguments in Ecology and Evolution

George F. Estabrook
Published online:

05 June 2012

Print publication:

29 September 2011
- Book
- - Get access
    
    Buy a print copy
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Scientists need statistics. Increasingly this is accomplished using computational approaches. Freeing readers from the constraints, mysterious formulas and sophisticated mathematics of classical statistics, this book is ideal for researchers who want to take control of their own statistical arguments. It demonstrates how to use spreadsheet macros to calculate the probability distribution predicted for any statistic by any hypothesis. This enables readers to use anything that can be calculated (or observed) from their data as a test statistic and hypothesize any probabilistic mechanism that can generate data sets similar in structure to the one observed. A wide range of natural examples drawn from ecology, evolution, anthropology, palaeontology and related fields give valuable insights into the application of the described techniques, while complete example macros and useful procedures demonstrate the methods in action and provide starting points for readers to use or modify in their own research.

9 - Dependencies
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp 182-212
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

6 - Parametric distributions
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp 122-140
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Basic concepts
Parametric distributions are used extensively in classical statistics because mathematicians have manipulated their calculating formulas to devise test statistics whose predicted distributions are among the pre-calculated ones, some of which are listed in the backs of older statistics books. In addition, if a member of a parametric family adequately describes the variation in data of interest, these data can be used to estimate values for parameters; the name of the family together with values for its parameters make a useful summary description of that variation. Parametric distributions can also help define a hypothesized random process when you take a computational approach to statistical argument.
Binary distributions
Consider again a binary random variable b with possible values 1 and 0 and a distribution given by Pr(b = 1) = p. Specified in this way, the random variable, b, is chosen from a large class of random variables that differ with respect to the mechanisms that sample them, but always have only two possible values, 0 and 1. Although the random variables in the class to which b belongs may differ, their distributions will be the same if Pr(b = 1) is the same for both. We use the word, family, to refer to a collection of distributions that all have the same form, such as all the distributions for the random variable, b, but differ by the value of a number, such as p, which we call a parameter. Distributions that can be easily specified by designating the family to which they belong and specifying the value of a parameter (or sometimes the values of two or a few parameters), are called parametric distributions. Well-known families have names. The distribution of the random variable, b, belongs to a family named binary distributions.

8 - Fitting distributions
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp 169-181
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Estimators are random variables
You may choose to hypothesize variability using a random variable with a parametric distribution. Once you have hypothesized an appropriate family of parametric distributions, you still must hypothesize appropriate values for the parameters. If you have observed data, then one approach is to use your hypothesis that the observed data sampled a random variable with some distribution from that parametric family to estimate the parameters. This can be done in a variety of ways. Notice that your data are now assumed to be samples of a random variable. When you do arithmetic with the data to create an estimate of the parameters, such an estimate itself becomes a random variable, described by its distribution.
Hypothesize that your data have been generated by a binary random variable, b, repeatedly sampled independently, but you do not hypothesize a value for p != Pr(b = 1). Instead you guess p from the data that you have observed. By what criteria could you make that guess? What properties might such guesses have?

Index
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp 256-257
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

1 - Introduction
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp 1-19
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Purpose
The purpose of this book is to teach you how to make statistical arguments using computational approaches. Such arguments are based on test-statistic probability distributions predicted by hypotheses, as in the classical approach. Unlike the classical approach, these predicted probability distributions are calculated by direct simulation of the hypotheses themselves. This approach enables you to use anything that can be calculated (or observed) from your data as a test statistic, and hypothesize any probabilistic mechanism that can generate data sets similar in structure to the one you observed. This approach frees you from the constraints, mysterious formulas, and sophisticated mathematics that classical statistics entails, and enables you to take personal control of your statistical arguments. To access this power, you will need to learn to program a computer (if you do not already know how). This task is greatly simplified through the use of spreadsheet macros, which enable the organization and input of data, as well as the output of results, using the spreadsheet itself. Many of you are already familiar with spreadsheets. The macros you will need to program will serve mostly to perform calculations, so that you will need to learn only a small sub-set of the programming language. In this book, I discuss basic hypothesis-testing statistical argument, data structures, choice of test statistics, some probability theory and its use in formulating hypotheses, and enough programming techniques to specify the calculations that simulate data sets using probabilistic hypotheses. Much of the discourse is with natural examples. Although this computational approach to statistical argument is widely applicable, these examples are mostly drawn from anthropology, ecology, evolution, and paleontology, which are my areas of interest, and those of most of my students.
Intended readers
This book is intended for readers who aspire to become, are already becoming, or who have become, research scientists who would like to feel more in control of their statistical arguments. It does not expect you to have prior training in statistics or computer programming, but some of my students who have had such training found this book valuable because it provided a very different view of these subjects, especially of statistics. Earlier versions of this book have existed in unpublished form for the past decade. I used them to teach my course on computational approaches to statistical argument at the University of Michigan. My students have been mostly Ph.D. students (about half in biological anthropology, and the others in paleontology, ecology, and evolution, with a few other areas occasionally represented). However, a few masters students, undergraduates planning to attend graduate school, post-doctoral fellows, and even fellow faculty have also participated.

4 - Random variables and distributions
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp 77-100
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

At random
We use the term, at random, in everyday speech, where it has a vague, intuitive meaning, but in order to use the term, at random, to make statistical arguments explicit, it must be clearly defined. There are philosophical debates and different schools of thought to define what, at random, should mean, especially as it applies to natural phenomena whose behavior we cannot predict with much precision from known deterministic causes. Although we may not be able to predict measurable values with much precision, usually there are some constraints and some patterns. If some quantitative aspect of a phenomenon is observed many different times, its values may tend to fall in a given range with an approximate frequency. One school of thought, called the frequentists, define the term, at random, in this way: something observed under specified natural conditions varies at random if, when large numbers of values are observed independently, they tend to fall in specified ranges with consistent frequencies. I will not discuss other schools of thought here, but for purposes of reading this text, you can consider yourself a frequentist.
The frequentist concept can be idealized to enable us to say, in some cases, what we think those frequencies should be. For example, flipping a coin results in two possible observations: heads or tails. We say that the coin is fair if the frequency of each of those two values is approximately 1/2 when the coin is flipped independently a large number of times. For our purposes generally, the words, observations are made independently, mean that if we observe the value of a flip, it does not affect the frequencies of the values of subsequent flips. More generally, “observations are made independently” means that if we know the observed value of an instance of a random process, then it does not change the frequencies with which possible values will be observed in a subsequent instance.

Contents
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp v-vi
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

11 - Contingency
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp 220-252
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Examples
Birds of a feather nest together, or do they? A pond has three islands where three different species of birds nest. Bird watchers observed the 5 nests on one island, the 8 nests on another island, and the 12 nests on a third island. Of these 25 nests, 10 belonged to birds of species A, 8 to birds of species B, and 7 to birds of species C. Do birds of the same species tend to nest on the same island?
I study culturally informed technology in the context of traditional Portuguese agriculture. As recently as the middle of the twentieth century, in many of the agricultural villages in the mountainous interior of the Beira Alta region of Portugal, virtually all the land was still owned by a few wealthy families. They rented large parcels to farming families, on a more or less permanent basis. These renters owned their own traction animals and equipment and managed their farm, but hired members of the many remaining families to work for them, on a daily basis, to perform much of the farm labor. These day-workers could work for any farmer who would hire them, usually in their own village, but sometimes also in other villages. I wondered if day- workers in the same family tended to work for the same farmer, or for any farmer, or avoided working for the same farmer as other family members.

5 - More programming and statistical concepts
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp 101-121
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

A question
I went fly fishing with my friend John. We caught 14 fish. He was casting a Blue Bobber and caught six fish, and I was casting a Grimacing Willy and caught eight fish. I wondered, “Do Blue Bobbers tend to catch the same size fish as Grimacing Willies?”
To answer this question using the methods I have presented, I need to: (1) use as data the 14 fish we caught, measured by weight and structured into two groups based on which fly caught them; (2) use as a test statistic a number, calculated from the data, that sums up how much heavier are the fish caught on Blue Bobbers than the fish caught on Grimacing Willies; (3) hypothesize a specific probability mechanism that represents the hypothesis that weights of fish caught on Blue Bobbers are not different from weights of fish caught on Grimacing Willies; (4) put that mechanism in motion to compute a large number of data sets, each similar in structure to the one observed, but constituting a sample of the hypothesis; (5) from each data set calculate a value of the test statistic; (6) sort these values and report them as an estimate of the probability distribution predicted by the hypothesis that fish weights are not different; (7) see where in this predicted distribution the observed weight difference falls; and finally (8) decide whether the observed data seem to be consistent with the hypothesis.

Frontmatter
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp i-iv
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

3 - Choosing a test statistic
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp 59-76
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Data from fossil marine organisms
Raup and Sepkoski (1984a) published a study of the fossils of several hundred marine organisms that lived between about 300 million years ago and about 10 million years ago. They divided this time period into 43 stages of about seven and a half million years each. In each stage they counted the number of taxonomic families that made their last appearance in that stage (presuming that these families went extinct during that stage). Then they compared each pair of successive stages to determine if, from the earlier stage to its next later stage, the number of extinctions went up (U) or down (D). These are their observations:
UU DU DDDU DUUU DU DU DUU DDU DDUU DDDUU DDUUU DDUU DD
Each letter stands for a pair of successive stages. With 43 stages there are 42 pairs of successive stages, and hence 42 letters. An extinction peak occurred whenever a U was followed by a D. Observe that the lengths of the times, in units of stages, of inter-peak intervals are 2, 4, 4, 2, 2, 3, 3, 4, 5, 5, 4. The initial UU is not counted as an inter-peak interval because you cannot be sure when it began. Likewise, the final DD is not counted as an inter-peak interval because you cannot be sure when it ended. The average length of inter-peak interval is about 3 1/2 stages, which in units of years is about 26 MY. One group of scientists suggested that during this 290 MY period, the extinction rate of marine families had risen and fallen periodically, with a period of about 26 MY.
The controversy
A second group of scientists questioned whether the data support this hypothesis. They observed that half the pairs of successive stages showed an increase in extinction rate and half showed a decrease (there are 21 Us and 21 Ds) and so they hypothesized that with probability 1/2 the next successive stage shows an increase in extinction rate, and with probability 1/2 it shows a decrease. This hypothesis implies a non-periodic random process. They showed mathematically that this hypothesis predicts that the average inter-peak interval length approaches 4.0 as the sequence of Us and Ds becomes very long. They argued that this is close enough to the observed average of 3 1/2 to interpret the data as consistent with this hypothesized non-periodic random process.

10 - How to get away with peeking at data
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp 213-219
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Examples
Paulo watched two male Mus domesticus (house mouse) interacting to establish dominance. For each 3 second interval, he recorded + if a mouse did a particular behavior, such as groom himself, or – if that behavior was not performed. This sequence was recorded for one of the behaviors of the finally dominant mouse.
– – – + – – + – + – – – – + – + – – + – + + – – + – – + + – + – –
Paulo wondered if the frequency with which the mouse performed this behavior changed at some point in the interaction; perhaps at the time he began to assume dominance. So Paulo looked through the data sequence for the time when the difference between the frequency before and the frequency after is the greatest. Then he used the hypothesis that the behavior is performed independently with the same probability for each interval to test if the frequency with which the behavior is performed before this time is different from that after this time. Somehow, he was not surprised that these frequencies turned out to be significantly different. After all, he peeked at the sequence to divide it at the point of maximal difference.
Jennifer and her colleagues observed fossil Forams (little shelled animals that lived at the bottom of the sea hundreds of millions of years ago) that had changed size through evolutionary time. Looking at a plot of size vs time before present, Forams seemed to be getting bigger. In addition, the plot of points seemed to bend at a particular time in the past, suggesting that the rate of increase after that time was more rapid. Had the evolutionary rate changed at that time? So they used the computational ANCOVA technique, described in Section 9.3, to test for equality of rate before and after that time. Somehow, they were not surprised that the rates were different. After all, they had peeked at the data to find the time before and after which rate difference was greatest.

7 - Linear model
George F. Estabrook, University of Michigan, Ann Arbor
Book:

A Computational Approach to Statistical Arguments in Ecology and Evolution

Published online:

05 June 2012

Print publication:

29 September 2011, pp 141-168
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Linear model
Fundamental to observing phenomena, hypothesizing explanations, and arguing their differential credibility is the recognition and quantification of distinctions. Are things different? If so, how different are they? The question, Can we distinguish by their weight the two groups of fish caught on different dry flies?, is an example. We have discussed several test statistics relevant to this question. Now we will examine an approach used by classical statistics to address the same basic question. This approach to hypothesis formation is widely used in the published literature of every natural science, and is not uncommon in the more quantitative publications of social science as well. It will be important for you to understand it, and possibly even use it, in your own work.
This approach is to hypothesize a non-probabilistic causal mechanism; any variation in the observed data that is not accounted for by this mechanism is attributed to error. This error is construed as random, quantified in a particular way that was convenient when people had to compute without computers. These non-probabilistic hypotheses are actually a whole family of such hypotheses, called the linear model. From this family, the particular member that minimizes the random (unexplained) variability in the data attributed to error is chosen. The result is a description of the variability in your data, and several candidates for test statistics.

Search Results

Refine search

Refine search

Actions for selected content:

17 results

2 - Programming and statistical concepts

Summary

References

Acknowledgments

A Computational Approach to Statistical Arguments in Ecology and Evolution

9 - Dependencies

6 - Parametric distributions

Summary

8 - Fitting distributions

Summary

Index

1 - Introduction

Summary

4 - Random variables and distributions

Summary

Contents

11 - Contingency

Summary

5 - More programming and statistical concepts

Summary

Frontmatter

3 - Choosing a test statistic

Summary

10 - How to get away with peeking at data

Summary

7 - Linear model

Summary

Search Results

Refine search

Refine search

Actions for selected content:

Save Search

17 results

Summary

A Computational Approach to Statistical Arguments in Ecology and Evolution

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary