Example exploration Old Faithful R.W. Oldford Old Faithful In the - - PowerPoint PPT Presentation
Example exploration Old Faithful R.W. Oldford Old Faithful In the - - PowerPoint PPT Presentation
Example exploration Old Faithful R.W. Oldford Old Faithful In the Yellowstone National Park, Wyoming, USA there is a famous geyser called Old Faithful which erupts with some regularity. The physical model is thought to be something like
Old Faithful
In the Yellowstone National Park, Wyoming, USA there is a famous geyser called “Old Faithful” which erupts with some regularity. The physical model is thought to be something like the following illustration (from Rinehart (1969), p. 572, via Azzalini and Bowman (1990)):
Old Faithful: stages
Azzalini and Bowman (1990) describe the stages as: Stage 1. “The tube is full of water which is heated by the surrounding rocks. The water is heated above the normal boiling temperature because of increased pressure. This due to the mass of water which is on top: the deeper the water the higher the temperature required for boiling. Moreover, ’whereas the water in the tube is superheated with respect to the ambient boiling point at the mouth of the geyser, the water temperature at depth is far below the boiling point curve that must be applied to a vertical column of water’. " Stage 2. “When the top water reaches the boiling temperature, it becomes steam and moves towards the surface. The pressure at the bottom then drops rapidly to the normal level and, by an induction effect, the bottom water rapidly becomes steam. This cascading mechanism is repeated several times: as water is converted into steam, the pressure on lower water is decreased, causing the production of more steam and triggering the eruption." Stage 3a. “ ’If at the time of cascading the temperature in the lower regions is lower than might be expected, cascading stops short of the bottom and the play is short.’ Stage 3b. “Alternatively, ’when the temperature is comparatively high at these depths, cascading works itself down much farther and the play is long’." Stage 4. “The geyser tube is completely or partly empty, ready to be filled with new water." “We do not discuss geological reasons for the fact that sometimes the cascading effect works down to the bottom
- f the tube while at other times it stops earlier. We simply note the phenomenon and discuss its consequences.
Stages 3a and 3b are associated with short and long waiting times for the next eruption. In stage 3a, the system starts a new cycle partially filled with hot water so that the following heating time is shorter; at the new eruption the entire tube will be emptied, since part of the water had already been heated in the previous cycle."
Old Faithful: Data collection
For each eruption, the waiting time w between its beginning and the beginning of the previous eruption is recorded to the nearest minute and the duration d of the eruption is recorded to fractions
- f a minute.
Collected from August 1st until August 15th, 1985 the data record the 299 successive eruptions which occurred during this time. Though R. A. Hutchinson, the park geologist, collected similar data sets, it is not clear from the source whether or not this data set is one of them. Measurements had to be taken through the night and duration times for these eruptions were recorded only as being
- ne of short, medium, or long (encoded here as 2, 3, or 4 minutes, respectively).
Questions
- 1. What might the scientific investigators have in mind for a target
population/process?
- 2. What might be the study population/process available to the scientific
investigators? Why might there be study error?
- 3. What is the sample in this case? Why might there be sample error?
- 4. Imagine the process for selecting a sample. How might this process produce
sampling bias?
- 5. Imagine the measuring process. What problem(s) do you think might be
associated with the measuring process? How might it manifest itself in terms of measuring bias and/or variability?
Old Faithful: The data
The data is available to us as a data.frame in R via the package MASS.
data(geyser, package = "MASS") nrow(geyser) ## [1] 299 names(geyser) ## [1] "waiting" "duration" geyser[1:2,] ## waiting duration ## 1 80 4.016667 ## 2 71 2.150000 summary(geyser) ## waiting duration ## Min. : 43.00 Min. :0.8333 ## 1st Qu.: 59.00 1st Qu.:2.0000 ## Median : 76.00 Median :4.0000 ## Mean : 72.31 Mean :3.4608 ## 3rd Qu.: 83.00 3rd Qu.:4.3833 ## Max. :108.00 Max. :5.4500
Old Faithful: The data exploration
Could look at each measurement to see how the values appear to be distributed. For example, simply sorted:
# the sorted values plot(sort(geyser$duration), xlab = "Index: smallest to largest", ylab ="duration") 50 100 150 200 250 300 1 2 3 4 5 Index: smallest to largest duration What is this plot called? What do you learn? Repeat the above for the waiting times.
Old Faithful: The data exploration
Could look at each measurement to see how the values appear to be distributed. For example, by a histogram:
hist(geyser$duration, xlab = "duration", col = "grey", main = "Old Faithful")
Old Faithful
duration Frequency 1 2 3 4 5 20 40 60 80 What do you learn? How does this connect with the previous plot? Repeat the above for the waiting times.
Old Faithful: The data exploration
Could look at each measurement to see how the values appear to be distributed. For example, by a density estimate:
# the density plot(density(geyser$duration), xlab = "duration", main = "Old Faithful") # fill it polygon(density(geyser$duration), col = "grey") 1 2 3 4 5 6 0.0 0.1 0.2 0.3 0.4 0.5 Old Faithful duration Density What do you learn? How does this connect with the previous plot?
Old Faithful: The data exploration
For eruption i,
◮ di denotes its duration in minutes and ◮ wi the time between its beginning and the beginning of the previous eruption.
Questions
- 1. Why might a plot of the pairs (wi, di) be of interest? What do you notice:
plot(geyser)
50 60 70 80 90 100 110 1 2 3 4 5 waiting duration
- 2. What other pairs might be of interest?
Old Faithful: The data exploration
- 3. Why might each of the following be of interest?
◮ (di, wi+1) ◮ (di−1, di) ◮ (wi−1, wi) ◮ (i, wi) ◮ (i, di)
- 4. How do we plot each of these?
Old Faithful: The data exploration
(di, wi+1)
n <- nrow(geyser) with(geyser, plot(x = duration[-n], y = waiting[-1], xlab = "duration", ylab="following waiting time", main = "Old Faithful") )
1 2 3 4 5 50 60 70 80 90 100 110
Old Faithful
duration following waiting time
What do you learn here?
Old Faithful: The data exploration
(di−1, di)
n <- nrow(geyser) with(geyser, plot(x = duration[-n], y = duration[-1], xlab = "duration", ylab="following duration", main = "Old Faithful") )
1 2 3 4 5 1 2 3 4 5
Old Faithful
duration following duration
What do you learn here?
Old Faithful: The data exploration
(wi−1, wi)
n <- nrow(geyser) with(geyser, plot(x = waiting[-1], y = waiting[-n], xlab = "waiting time", ylab="following waiting time", main = "Old Faithful") )
50 60 70 80 90 100 110 50 60 70 80 90 100 110
Old Faithful
waiting time following waiting time
What do you learn here?
Old Faithful: The data exploration
(i, wi)
with(geyser, plot(x = waiting, type="l", xlab = "index", ylab="waiting time", main = "Old Faithful") )
50 100 150 200 250 300 50 60 70 80 90 100 110
Old Faithful
index waiting time
What do you learn here?
Old Faithful: The data exploration
(i, di)
with(geyser, plot(x = duration, type="b", xlab = "index", ylab="duration", main = "Old Faithful") )
50 100 150 200 250 300 1 2 3 4 5
Old Faithful
index duration
Why was type “b” used? What do you learn here?
Old Faithful: The data exploration
(i, di) points only
with(geyser, plot(x = duration, type="p", xlab = "index", ylab="duration", main = "Old Faithful") )
50 100 150 200 250 300 1 2 3 4 5
Old Faithful
index duration
What do you learn here?
Old Faithful: The data exploration We could add a “smooth” function estimating the mean response to any of the above plots. For example (di−1, di)
n <- nrow(geyser) x <- geyser$duration[-n] y <- geyser$duration[-1] smoothfit <- loess( y ~ x) xvals <- extendrange(x) xvals <- seq(min(xvals), max(xvals), length.out = 400) predictions <- predict(smoothfit, newdata = data.frame(x = xvals)) plot(x, y, xlab = "duration", ylab="following duration", main = "Old Faithful") lines(xvals, predictions, col = "steelblue")
1 2 3 4 5 1 2 3 4 5
Old Faithful
duration following duration
Old Faithful: The data exploration Now do the same for the waiting times: add a “smooth” function estimating the mean response.
n <- nrow(geyser) x <- geyser$waiting[-n] y <- geyser$waiting[-1] smoothfit <- loess( y ~ x) xvals <- extendrange(x) xvals <- seq(min(xvals), max(xvals), length.out = 400) predictions <- predict(smoothfit, newdata = data.frame(x = xvals)) plot(x, y, xlab = "duration", ylab="following duration", main = "Old Faithful") lines(xvals, predictions, col = "steelblue")
50 60 70 80 90 100 110 50 60 70 80 90 100 110
Old Faithful
duration following duration
Old Faithful: The data exploration Or could add a two-dimensional density estimate.
library(MASS) # contains the kde2d() function with(geyser, {plot(x = waiting, y = duration, main = "Old Faithful") dens <- kde2d(x = waiting, y = duration) contour(dens, col = "steelblue", add = TRUE) } )
50 60 70 80 90 100 110 1 2 3 4 5
Old Faithful
waiting duration
. 2 0.002 . 4 0.004 . 6 . 6 0.008 0.008 . 8 . 1 0.01 . 1 0.012 0.012 0.012 . 1 4