# Why simulate data?

While it is true that nothing beats working with a real dataset that represents the actual forest conditions we are interested in answering questions about, it is often really handy to build fake datasets instead. Sometimes you go through the trouble of identifying, obtaining, importing, and cleaning a real dataset, only to find that it doesn’t actually have the characteristics you are were hoping for.

Therefore as a starting point, it is often helpful (and faster) to generate some fake data that might have been produced from a certain kind of sample design or forest condition that you are interested in modeling or making inference about. This happened to us the other day, and we wound up simulating some data in a few different ways. I thought I would share one of those approaches.

# Start with a problem

When you are going to simulate some data it pays to do a little thinking before you even crack open your Jupyter kernel.

The way you approach simulation will by necessity begin with the specific kind of questions you are trying to answer. A well-defined problem will allow your simulated data to be *parsimonious* - that is it will include all the features you are interested in asking questions about, but not include those that are irrelevant. To pull out the old chestnut from Einstein: “Everything should be made as simple as possible, but no simpler.”

The problem that we were trying to address the other day has to do with the behavior of estimators under unequal probability sampling in a two-stage sample, specifically two-stage sampling with probability proportional to size (PPS). This can come about in forestry in a few different ways, but here is a very common scenario:

- We are interested in measuring some characteristic of a forest (in this example I’ll use basal area)
- The forest is divided into stands, which we will measure by installing plots.
- We aren’t going to sample all of the stands
- We’d like to preferentially sample the larger stands.

Note that for now I’m going to stay away from the question of how you make inference from a sample generated in this way (though we may post about that in the future). Instead the focus of this post is how you might go about generating some fake data that would allow you to begin to think about it.

# Shape of the data

Now that we have a well-defined problem, it is good to think about what the data themselves might look like. What are the features of the data that we need in order to think about this problem? Or, put another way, what might a minimal dataset look like if we were to actually carry out the sample we are trying to simulate?

In the case of the two-stage PPS sample problem at hand, a minimal dataset would contain four neccesary fields: stand id, stand size, plot id, and basal area measurement. We really don’t need anything else, though a different or slightly modified question might require us to add additional detail. For now, though, that’s enough, so let’s mount up and make some fake forests!

```
library(tidyverse)
library(gganimate)
theme_set(theme_bw())
```

# Stand Size

The forest is made of stands that cover a range of conditions. Before we start thinking about how these vary among stands, it makes sense to consider what we need to know for any single stand: - How big is it? - What are possible basal area measurements that could be collected from it?

The way that we answer these questions is by deciding on an appropriate probability distribution we will draw from to produce our data. This is the meat of making good simulated data - intelligently selecting distributions that mimic ‘real world’ data you could collect in the field. To help with this, it is handy to have some familiarity with some common probability distributions and the parameters that control their shape. A good guide to check out is Gavin E. Crooks’ Field Guide To Probability Distributions. I also recommend pulling out the `stats`

distribution tools and plotting some densities to get a sense of what they feel like.

For stand size, I chose to draw from a non-standard beta distribution with upper and lower bounds 200 and 10. A regular beta distribution has support on the interval \([0, 1]\). For non-standard beta, if a variable \(Z\) is beta-distributed, then \(X = Z (u - l) + u\) where \(u\) and \(l\) are the upper and lower bounds of the support. The choice of the non-standard beta comes from my domain knowledge about stand sizes in many types of forest ownerships - management units less than 10 acres are often considered too small to be practical, and management units larger than 200 acres are too large. Furthermore, I think it is more likely to have slightly right-skewed distribution, which means that the very largest stands are slightly less probable. This last piece leads me to select parameters \(\alpha = 2, \beta = 5\), provided in the `shape1`

and `shape2`

arguments in the `rbeta`

call below, and which give the distribution the desired shape. To see this, we can make 1000 draws and check out the density.

```
min_acres <- 10
max_acres <- 200
beta_draw <- rbeta(n = 1000, shape1 = 2, shape2 = 5)
stand_acres <- beta_draw * (max_acres - min_acres) + min_acres
qplot(stand_acres, geom = "density") + xlab("Stand Acres")
```

This looks pretty good to me - the highest density is around 40 - 60 acres, with a long tail to the right representing a few much larger stands.

# Possible Plots

The next step is to simulate basal area data that might come from a stand. Here we are thinking about measuring plots on hypothetical stands that might exist in the forest we are trying to simulate. What are the kinds of conditions we might encounter in that hypothetical forest, and what range of measurements would those conditions produce? In a given forest, we might encounter a variety of age classes - from regenerating stands to early stem exclusion to old growth areas. Moreover, *within each stand* we might encounter a range of conditions, depending on where the plots might fall and how heterogenous a given stand might be.

Again, the choices of distribution and the parameters that define them will at this point depend directly on the kinds of forestland we are trying to simulate. For example, we might make very different choices regarding the selection of distributional family and the range of possible parameters if we are attempting to simulate data from an upland plantation than from a bunch of bottomland hardwood stands. The level of detail and care we need to put into picking exactly the right distributions to sample from is also in direct relationship to how those choices will impact our ability to ask and answer the question at hand. For some questions, the generalization of simple assumptions is good enough - for example, the data are normally distributed - even if you know that not to be the case. Other times, we have to be meticulous in our choices because they will impact the kinds of questions we can ask and conclusions we can draw from our simulation.

For the present example, I decided to use another non-standard beta with upper and lower bounds set at 350 and 0. Because I want the simulation to reflect the possibility of both within- and between-stand variation, I also chose to randomly draw the shape parameters of the beta distribution from uniform distributions. The uniform draws for the shape parameters reflect a lack of knowledge about the *condition* of any given stand in addition to the random placement of plots within it. I set the upper and lower bounds of the uniform distributions here by trial and error - plotting densities and examining their shape. Basically I was going for the possibility of both right- and left- skewness to reflect regeneration and mature forest, respectively, as well as conditions in between.

I selected the number of samples to pull from the distribution based on the size of the stand, in this case 100 samples per acre. One way to think about each of these samples is a possible observation that could have been measured in the stand in question. The key here is to pull just enough samples to cover the range of the distribution pretty well.

```
max_bapa <- 350
min_bapa <- 0
bapa_beta_draw <- rbeta(
n = stand_acres[1] * 100,
shape1 = runif(n = 1, 1, 15),
shape2 = runif(n = 1, 15, 20)
)
possible_plots <- bapa_beta_draw * (max_bapa - min_bapa) + min_bapa
qplot(possible_plots, geom = "density") + xlab("Basal area per acre (ft)")
```

Once we have made a single stand that looks basically how we want, it makes sense to go ahead and functionalize the simulation and iterate it to get a collection of stands. That way we can look at the variation of all the stands and make sure that our simulation is representing the range of forest conditions that we are looking for.

```
make_stand <- function(i) {
max_acres <- 200
min_acres <- 10
acre_beta_draw <- rbeta(n = 1, shape1 = 2, shape2 = 5)
stand_acres <- acre_beta_draw * (max_acres - min_acres) + min_acres
max_bapa <- 350
min_bapa <- 0
bapa_beta_draw = rbeta(
n = stand_acres * 100,
shape1 = runif(n = 1, 1, 15),
shape2 = runif(n = 1, 15, 20)
)
bapa_samps <- bapa_beta_draw * (max_bapa - min_bapa) + min_bapa
# Each stand is a list of its ID, its acreage, and the possible plots that
list(
id = i,
acres = stand_acres,
bapa = bapa_samps
)
}
n_stands <- 500
set.seed(20190211)
all_stands <- lapply(1:n_stands, make_stand)
```

The next part of the simulation is pretty straightforward. At this point, we have the population of stands across the range of stand conditions that we’d like to measure, and for each stand we have a population of possible plots that could have been measured in that stand. Now we have to do the two-stage sampling part - selecting stands and then sampling the population of possible plots to generate our fake cruise data.

For the first part, let’s say we know (based on the measurement needs and budget of the landowner) that we’d like simulate a cruise of about 20% of the stands, and that we’d like to preferentially cruise the larger stands.

```
all_acres <- all_stands %>% map_dbl("acres")
stand_sample_probs <- all_acres / sum(all_acres)
sample_stands <- sample(
all_stands,
size = length(all_stands) * 0.2,
prob = stand_sample_probs
)
ggplot() +
geom_density(aes(x = all_acres, color = "population")) +
geom_density(aes(x = map_dbl(sample_stands, "acres"), color = "sample")) +
xlab("stand acres") +
theme_bw() +
theme(legend.title = element_blank())
```

Once we have chosen our stands, we want to simulate the plot measurement process. For the purposes of this demonstration, let’s assume we have some systematic random sample, and a budget to measure at an intensity of about 1 plot every 4 acres. One way to do this in R is to define a function that simulates the measurement of a single stand, and then iterate that function over the all the stands that we intend to sample. Here I use the `purrr`

function `map_dfr`

to bind the output of each iteration together into a tibble.

```
measure_plots <- function(stand) {
n_plots <- stand$acres / 4
tibble(
stand_id = stand$id,
acres = stand$acres,
plot_id = 1 : n_plots,
bapa = sample(stand$bapa, size = n_plots),
source = "sample"
)
}
plot_data <- purrr::map_dfr(sample_stands, measure_plots)
```

We can go ahead and extract the population data from each of the stands that we sampled, so that we can compare the shape of our samples to that of the population from which they were drawn.

```
get_stand_pops <- function(stand) {
tibble(
stand_id = stand$id,
acres = stand$acres,
bapa = stand$bapa,
source = "population"
)
}
sample_stand_pops <- purrr::map_dfr(sample_stands, get_stand_pops)
```

At this point, we have successfully generated a simulated dataset! Before we put it to work, though, we should examine the data that we have generated to make sure that it looks like we expect it to and will have the properties that will make it useful for our purposes. The gganimate package makes it easy to examine the range in basal area distributions in our simulated samples, and also to compare them to the population distribution from which they were drawn. If we see something we don’t like, now is the time to go back, examine our assumptions, and update the way the simulation is generated.

```
ggplot() +
geom_density(
data = sample_stand_pops, aes(x = bapa, group = stand_id, fill = source),
alpha = 0.3
) +
geom_density(
data = plot_data, aes(x = bapa, group = stand_id, fill = source),
alpha = 0.3
) +
theme_bw() +
transition_manual(factor(stand_id))
```

Once we are satisfied that the data look the way that we might expect, we can put the simulation to work and ask questions from it. If we have made good choices in designing our simulation, the data we have generated are examples of what might exist in the real world. Unlike real world data, however, we did not have to go through the great expense of actually generating the measurements. Also unlike real world data, for this simulation we have known properties (mean, variance, quantiles, etc.) of both populations and samples at our fingertips, and we can exploit that knowledge to check assumptions, see how well statistics computed using samples estimate population properties, build inferential models, and many other analytical tasks.