Articles

# Excel 2013 Statistical Analysis #43: Simple Random Sampling in Excel: Process & Theory

Welcome to Excel 2013
statistical analysis video number 43. And if you want to download this
workbook file or the PDF files for chapter 7, click on
the link below the video. Hey, chapter 7, we got
to talk about sampling, and then one of the more
amazing things, the sampling distribution, which will
be a distribution of all of our x bars and p bar too. And we’re going to learn
about the central limit theorem, which will come from
some things we learn when we create a sampling distribution. Now, I’m going to go to
actually to page 9 in the PDF. Here is page 9. In chapter 6, we looked at
our normal distribution. But we were always talking about
x values and probabilities. Not only that, but we
had the population mean and the population standard
deviation, mu and sigma. And we were talking
about x, right? But here’s chapter 7. We’re going to treat the x
bar as a random variable. That means we will plot
all of the possible x bars from a given set
of population data. And we’re going to discover
some profound things. The normal
distribution down here, our random variable
is going to be x bar. Now, in chapter 7,
we’re still going to know the population mean. But we’ll learn something
quite amazing about that. And we’re still going to know
sigma or population standard deviation. But all of our random
variable, those are going to be actual
samples we take, x bar. And we’ll also talk
about proportions where we have p bar. Then in next chapter,
chapter 8, we’ll talk about what to do
when you don’t know sigma, the population
standard deviation. Home to get back to page 1. All right, so sampling
and sampling distribution. We’re going to start off
by talking about sampling. Hey, population? That’s all of the data points. Sample? Just some of them. We’ll talk about
one or two methods of actually how to go and
pick your sample items. But you get the idea– all the items,
some of the items. Now, what happens is we
have a population mu, our mean of all of
these sales values– 289 bucks approximately, right? Well, when you took a sample
and calculated mean x bar, we got what’s called a point
estimate of our population mean, our x bar, 255. Well, what’s the difference
between these two things? It’s called sampling error. And we can see it’s 34 bucks. Now, the question
is what are we going to do with that sampling error? Is the sampling error
just sampling error? Hey, we just took
a sample, so we’re not going to have
exactly the same values, or does it represent
a true difference? So is the point
estimator good enough? And really, what we’re going
to ask is, is it reasonable? So in this chapter, we’re
going to build a model so we can check whether our
point estimator, either x bar or p bar, and the sampling
error, are those reasonable? Sometimes, we’re going
to take a sample, and it won’t be
reasonable, which means it’s not a
good point estimator. Other times, we’re
going to take a sample and we’ll run our test. And it will be reasonable. And we’ll say that,
yeah, the point estimator is a fair representation for
the population parameter. Again, the model
that we’re going to use here to help us decide
whether our point estimate is reasonable is called the
sampling distribution. But first, before we talk
sampling, page 3 in our PDF, sampling. Hey, why do we take
samples in the first place? Well, a few reasons. Some populations are
just impossible to check. We can’t count all
the fish in the sea. And yet, it’s very important
sometimes for people to do samples to figure out
how many fish there are, how many dolphins,
how many orcas. Another reason, calculating off
of the population costs a lot. An example from General Mills– they might hire a firm
to test a new cereal. The sample might cost 40K. Population– that
means going out and asking every single person. No way, way too much money. Another reason that
we take samples is contacting a whole
population is time-consuming. Political polls can be
conducted in one to two days, whereas, of course, contacting
the whole population would take years. How about destructive
nature of some tests? If you’re making wine
in manufacturing, you cannot test every
single bottle of wine. If you make seeds
for farmers, you can’t go out and test every
single one because then there would be none left. And finally and most
importantly, samples are usually adequate. And a great example is
the consumer price index. Those taken from samples
are quite accurate. Now, why do we need samples? Hey, we select samples to
Community College, do Highline students think
advising should be mandatory? That’s a question
right here on campus they’re trying to
figure out right now. We could, at Highline,
get to every single person because actually we have a list
of all the registered students. But it would take a long time. And it probably would be
impossible to get everybody to answer this question. But sampling, we could
get a reasonable estimate of what people thought. For manufacturing, is the
machine filling accurately? So if you’re filling boxes of
cereals or bags of lettuce, we absolutely need to
periodically sample to make sure the
machine is working. Hey, next question
is so big, how do we know where to go
and get the data? Well, the first thing is
we have to be careful. And the big thing we
have to be careful of is we must make sure that the
sample population is the same as the target population. Now, sample
population, that’s just population from which
the sample is drawn. And target population,
that’s the one we’re trying to actually
get at, population, we want to make the
inference about. Here’s an example to illustrate
the difference between sample and target. The goal is to gather data about
students at Laney Community College. That’s a community college
in Oakland, California. By the way, they have a great
culinary arts restaurant there. So here’s the deal. If you took a sample from Laney
College registration list, the sample population would
equal the target population. Because remember, our goal is
to gather data about students at Laney College. But if we took a
sample from people who ate at the culinary
arts restaurant, that means we just went there,
and it was convenient for us just to ask people
as they were eating, right? The problem with
that scenario is, we would get some students
from Laney College. They would be in the
target population. But we’d also get
some people eating who are not Laney College students. Those people are not in
the target population. So in this example,
the sample population would actually be people
eating at the culinary arts restaurant, whereas
the target population was all the students. So the target
population would not equal the sample population. Some other examples–
if we were trying to calculate Seattle
house price mean, and we accidentally used
some data from Burien also. The target population was
just Seattle house prices. But we accidentally mixed and
got some from Burien also. The most famous example,
example three– and I’ll let you Google this because
this is quite an amazing one– the 1936 presidential
election was between Landon and Roosevelt. And there was
a famous poll that was done. And they only used phone
directories and other lists that they mistakenly
thought represented the whole population. But they only represented
middle and upper class people. So the views they got
were highly biased. Well, this famous
poll disastrously mispredicted the outcome. And of course, Roosevelt won. Here, the sample population
was the middle upper class instead of what
they really wanted, the target population,
which was lower, middle, upper, and everything
else, all the classes together. All right, now, we got to
talk about what a frame is. And a frame as simple. It’s just a list of all the
elements in the population, or a list of elements that
the sample is selected from. Now, you can imagine that
frames are not always possible to create. So example frames
that you can create? Well, if we had a
list of all the names of registered students at
Laney College, that’s something we can get, or obtain. A list of all the companies
listed on the New York Stock Exchange, so if we
were going to do a sample on just some
of the companies, we could choose from this list. These are finite lists. If we actually have
a list, then we can use what’s called
simple random sampling. And we’ll look at an
example of that in Excel. We’ll put a list there and
randomly select some data from that list. Examples of frames that
cannot be constructed– population is too big, like
all the fish in the sea. Actually, there
was a famous study just recently too back in
2008, I think it was started. And it went on for years and
years trying to actually count all of the fish in the sea. But of course, they
were doing a sample. Also, frames that
cannot be constructed– populations from an ongoing
process, for example, machines filling
boxes of cereal, transactions
occurring at a bank, customers entering at a store. In all of these cases,
it’s theoretically infinite because there’s no upper limit. We consider it infinite because
we cannot construct a frame. Even though customers
are entering a store, we can count there’s two
people that just came in. But theoretically,
any number of people could have come in at
that particular moment. As you can imagine,
if you have a frame, then it’s a lot easier
to randomly select items from a frame than if
you don’t have a frame. Now, let’s scroll down and talk
about two sampling methods. The first one we’re
going to talk about is when you have a
finite population where you have a frame. We can use simple
random sampling. Now, we’re going to cover this
over in Excel in just a second. But the key is we’re
going to add a new column and use the RAND function to
randomly generate a number. And then we’re going to
sort those random numbers and pick whatever
items are at the top. Here’s some
that over in Excel. Now, the second
method for sampling is for infinite population,
or an ongoing process. That’s called random sample. Now we’re going to select any
end units in a random way. But two things must be true. Each element must be
from the same population. Sample population has to
equal target population. Now, for an ongoing
process, people walking in or an infinite
population, remember, we don’t have that
frame, that list. If we look up here for
a simple random sample, we’re actually going to have
the frame sitting right there. So that’s all of the items that
are sampled and the target. But down here, if it’s
infinite or an ongoing process, we have to be really careful
that the sample population is equal to the target population. The second thing
that must be true is each element is
selected independently. This is so we don’t have bias. We don’t want to select all
the items from a similar group, similar attitudes. Now, if you imagine you
were polling people walking into a restaurant,
well, if you only ask people of a certain
age group because you felt comfortable
talking to them, that would mean you are not
selecting sample elements independently. There is a bias. So both of these
things have to be true when you do a random sample. Now, when you’re doing
an ongoing process like filling a
machine, to make sure that each element is
from the same population, you choose samples from
the same point in time, so all elements from
the same population. So you make sure that you
pull all the bags of lettuce or all the bottles of ketchup or
whatever and right at that one point of time, and then weigh
them, sample them, et cetera. Now, I want to go over to
Excel and figure out how to do a simple random sample. Here’s the method that
we’re going to use. But I want to investigate a
little bit before we actually do the actual random sample. And we’ll look at the definition
of simple random sample. The definition is a
simple random sample of size n from a finite
population of size n is a sample selected such that
each possible sample size of n has the same probability
of being selected. Now, before we can really
make sense of this, I actually want to try and
list every single sample and see what they’re
talking about. So let’s go over to Excel. Now, before we list all
the possible samples, let’s just look at one sample. Here’s our original data set
that we started off this video with, sales rep and sales. And here is one possible sample. There’s the names. There’s the values. Now, we have a population size
big N of 7, 1, 2, 3, 4, 5, 6, 7. The population
mean or mu is 289. We just took an average. Population standard
deviation is about 74. Now, here’s our sample. Sample mean equals x bar
equals the point estimator of the population mean. So here, we come over
and we calculate our mean using the average function. When I hit Enter, 255. That’s the point estimator
of our population mean. But we can see that
there’s a big difference. And the difference is
called sampling error. We’re going to take x bar and
subtract our population mean. Enter. So our point estimator is \$34
below the population mean. Now, part of what we’ll
do in this chapter is we’ll find out if our point
estimator and the sample error seem reasonable, or not. Is this sample error too big? Or is it OK? It’s just sampling error. And we’ll accept this as a
reasonable point estimator to the population mean. Now, this is just one sample. Let’s go over to the sheet,
all the possible samples. And no way. Here’s our data set. We’re going to try and
list every single sample. Now, we actually did
this earlier in the class where we looked at big N
population 7, sample size 3. We used the Combination
function to calculate how many total samples there would be. I got busy listing
every single one. Jo and Sue, I listed
those five times. And then I listed
all the matches. Oh, and I went
through and did it over and over for every
single possible sample. Now, the one we just did on this
sheet right here, Chin, Kip, Jo? That’s in here somewhere. But first, I want to actually
get every single number. So this 185 goes here. This 250 goes here. So for Chin, this
210 goes right here. Then I want to calculate
every single possible x bar. Now, there’s no way I’m going to
copy and paste, copy and paste. Instead, I use the VLOOKUP. Notice in this cell, I
need to look up Jo’s sales and bring it back over here. Here, I need to look up
Sue’s sales right there, 250, and bring it back down here. All the way down here,
I need to look up Tyrone’s sale on this
table, get that number, and bring it back over here. Sounds like the perfect
job for VLOOKUP. Now, VLOOKUP needs
a lookup value. I’m going to say, hey,
VLOOKUP, go look up Jo– and that’s a relative
cell reference– comma. The table– first column has
the items we’re matching. Subsequent columns
have things that we’re trying to get and
bring back to the cell. Now, I’m going to hit
F4 to lock that comma. Now, I need to tell VLOOKUP
which column, one or two, has the thing that I want to
get and bring back to the cell. Well, of course, the second
column has the sales number. So I just put a 2
comma exact match because we’re
exactly matching Jo in the first column, false,
or zero, close parentheses. Control Enter and watch this. I’m going to copy it over and
double click and send it down. I’m going to go to the
last cell, F2, to check it. And sure enough, it’s
looking up Kip in there. And it got exactly
the right number. Man, I would not like
to do that manually. Luckily, we have VLOOKUP. Now, we can do the x bar for
every single possible sample point. Control Enter and double
click and send it down. Now, if we look through here–
and I don’t remember who it was, Chin, Kip, Jo– So Jo and Chin, and
here it is right here. And there’s our sample. But look, that’s just one. And there’s 34 other
possible samples. Now, part of what
we’re doing here is helping us understand
that definition. But also, this is
going to lead us up to the sampling
distribution of x bar and then the central
limit theorem, which will give us real power. Now, let’s just go
back to our PDF. And notice it says a simple
random sample of size n from a finite
population of size n is a sample selected so that
each possible sample has the same probability. So here it is. We have every single
possible sample. And what’s the
probability of each one? It’s a uniform distribution– 1 divided by 35 possibilities. F4, Control Enter, double
click, and send it down. So if we’re going to do a
real simple random sample, the process we use
better have 0.0285 for every single one of
the possible samples. Again, if we calculate
randomly select a sample, it’s going to be one of these. All right, well, guess what? There is a built in function
in Excel that follows a uniform distribution. And we’re going to
use the RAND function. Now, the RAND function
is argument list. That means you just
put open parentheses, close parentheses,
and then Enter. It randomly generates a
number between 1 and 0. It’s actually greater than or
equal to 0 and less than 1. But it’s got a number
with 15 digits. And it’s random. If you hit the F9 key,
which is the evaluate key, it randomly generates
numbers between 1 and 0 up to 15 digits. This follows a
uniform distribution. And we can use this when
selecting our samples to simulate exactly this. Now, let’s scroll over here. We will come back to this
when we build our sampling distribution of x bar
and use this to build the actual distribution. That just means we take all
of our random variables x bar and list them in a histogram
or some sort of chart to look at the distribution. But what we want
to see now is how to create a simple
random sample. So we’ll click on the
sheet simple random sample. Here’s our data set. We’re going to add a column
with the RAND function. Then we’re going to sort it. And then copy and
paste the ones on top. Our sample size equals 3. We obviously have a
finite population. And here’s our frame, right? So we’re going to start
off with equals RAND. Control Enter and copy it down. Now, one thing about the
RAND function is watch this. If I start typing over here,
watch when I hit Enter. Oh, anything you do at all
triggers RAND to recalculate. So what we’re going to do
is now we’re going to sort. And I’m going to go to the data
and right-click uncollapse. So I want to click
in a single cell. And let’s, before we
apply our rule here, just watch what happens when
I sort A to Z. Remember? It’s supposed to bring
the smallest on top. So I clicked it. Is the smallest one on top? Well, I can already
see 0.0a, 0.07, 0.02– that’s the smallest. The reason why is
the act of sorting– click– is another thing that
triggers the RAND function. So when you sort, it actually
does sort the smallest number to the top and the associated
records just for a second. But when it stops, it
retriggers it again. So when we click A to Z, even
though the smallest number is not necessarily at
the top, the data set was sorted using a
uniform distribution. So here’s our rule. Add the RAND. Click sort A to Z
button five times. It could be any number– 1, 2, 10. Just always keep consistent. And then copy and paste
the top three records. So you ready? Click in a single cell. One, two, three, four, five. And that’s it. The top three, I’m going to
Control C and paste over here. Now, I can calculate x bar for
this particular sample equals average. And there we go. Control Enter. 215, so x bar,
215, and let’s all go see if we can
find that over here. It’s Sue, Jo, Chin. So oh, my heavens. It’s the very first one. Look at that. And there it is,
that particular 215. Now, we go back over to
simple random sample. Now, we could do
this another way. And this way is not necessary. This way is kind of
nice because we just take them and copy and paste. This method will
automate the process. If we do it with a
formula like we’re going to do over here to
extract randomly three records, every time
we do anything, it calculates all
of our formulas and reformulates our sample. But nevertheless, notice we
have a first column with RAND. And one of our rules is that we
need the three smallest values. Well, we can use this SMALL
function to actually retrieve the three smallest. The array– that’s this
column right here now– F4 to lock it because we’re
going to copy it down, comma and k, first
smallest, second smallest, third smallest. Boom, like that. Close parentheses, Control
Enter and copy it down. Any time they change, F9,
it automatically lists them first smallest, second
smallest, third smallest. Now notice we have a match
between some lookup value and a first column
in a lookup table. The second column
has our sales rep, which we want to return here. And our third column
has our sales, which we want to return here. So watch this– equals VLOOKUP– there is my lookup value– comma the table array,
the entire range, F4 to lock it comma. In the column index, I’m simply
going to put a 2 comma 0 close parentheses. Control Enter. Now, I can copy this down. And watch this. I’m going to cheat. I’m going to copy this over. It’s going to be totally
wrong for a second. Double click that. Click on the SMALL. And then change this to a 3. Control Enter and copy it down. And so there we have random
samples being created. By the way, if you knew
mixed cell references here, we could do one formula. But we don’t know mixed cell
references in this class, so we’re not going to– we’re going to do it
in a 2-step process. Hey, let’s calculate our
x bar Tab equals average and get this range right here. Enter. We’re even going to calculate
our population mean over here. And then I’m going
to calculate sampling error equals the particular
x bar minus our mu population mean. Wow, so that particular
sample of the 35 possible was 9.7 away. F9, and every time I hit F9,
so we’ve got a 29, a minus 40, a minus 34– that was our one we had before– 51, minus 4, minus 74, minus 16. So there we go,
simple random sample. Now, in this video, we
talked about creating a simple random sample. We used RAND,
SMALL, and VLOOKUP. We also did just the
RAND and the SORT five times and copy and paste. We also very importantly saw if
we have a particular population and sample, we can calculate
all the possible samples, list them, get the sales,
calculate the x bar. We’ll use that next video in our
sampling distribution of x bar. But here, we used it to
just see that the definition of simple random sample set,
each one of these samples has to have the
same probability. And then we talked about a point
estimator and sample error. And then back in our PDFs,
we talked about sampling. Next video, we’ll actually
create our sampling distribution of x bar
and discover something quite amazing about the mean
of the sampling distribution and about the population mean. All right, we’ll
see you next video.