--start biostat 2.3.97--- we'll be starting handout 5 next time. note: should always read the handouts in your Copious Free Time. the sampling philosophy is very simply. diagram from handout four - bell shaped curve...populaton has average mu and standard deviation sigma, and you usually don't KNOW or ever find out those values. in real research, we seek those parameters but never really get them. so you draw a sample, a single sample of size n. from that sample you take the sample average, add and subtract from it a constant Z sigma (see handout) and you get a range of values. the Z that you pick gives you the accuracy you require. it's arbitrary. we've been using about 2 or for 90% confidence 1.96. you multiply that from known standard deviation from sample distribution, and you get the range by adding/subtracting from average. then you get a depiction of what you COULD see from a large number of samples that you COULD have taken, but of course you only take the ONE single sample. if z is fairly large, say two, you drive down the number of times you miss the true average mu. if you pick z=2, and sampling is approximately normal, you have a miss 5% of the time. if z=2.58, you have a 1% miss rate. but of course the bigger you make z the wider the interval gets. the only way to have bigger z and smaller interval is to RAISE SAMPLE SIZE. standard error is simply a function of sample size. this is how this works generally. next steps... what can we presume/assumewhen you don't know standard deviation,a nd what is likelihood that sampling distributions will be normal/gaussian? but before that, any questions? no. ok. you have a population of size n from which you sample when you decide to pick a sample size, you immediately define another distribution called the sampling distribution. instead of having the original distribution you have this sampling distribution. both of these have normal distribution then the z function discussed above is in fact z because it is gaussian. if the population is EXACTLY normal, your sampling distributionn will be normal for any sample size. but what if n is not normally distributed? what if it's some funky-ass weird shaped curve? eg spikes and waves all over? well, as size of sample goes up, irrespective of shape of original distribution, the sample distribution tends to become normal. as sample size approaches half the size of population, you maximize number of possible samples - eg, 500,000 out of a million - that's the maximum number of samples that exist. (?) but you never take that large of a sample. generally you find if you're sampling from population with single mode, not multimodal, and if variance is finite, irrespective of amount of skew involved, 30 observations in your sample is usually good enough to guarantee approximate normality. if you remove qualification of single mode, in such a case, it was figured out that even with weird-ass curve, normality is guaranteed with samples of 50 or 60. if samples don't exceed that number (they usually do) other methods can be employed to get normal distribution. but usually it just works out that the sample distribution is near normal following these guidelines. 60 inthis case. or 30 or sometimes fewer with single mode....if no skew in population, even samples of size 10 or so will work... see, for z to be correct, sampling distribution has to be near normal. the more normal population is to start with, more quickly sample distribution approaches normal. now, up to this point, we've assumed knowledge of variance/standard deviation. but in fact, you usually don't know the average OR the standard deviation, and so you have to estimate it using an n-1 in denominator to make it unbiased. [something about history of probability research...] [something about beer kegs and temperature and beer delivery] so Guinness used to test 200-300 kegs/ten thousand to test for freshness. then this guy figured out they only needed to look at about 30 kegs instead, to get similar probability results. when the guy figured this out, he wanted to publish it. Guinness stout allowed him to publish it under a pseudonym ("student") so that competitors wouldn't figure out who he was. this is t = bar x - mu / s/root n. (please see handout.) see p 13a of handout 4. you will see a version of a t table. there are many versions of these available, and are a bit different from a normal table, since there are many t distributions. usually a selected number of probabilities are chosen and selected OUTSIDE the table. take the second row: level of signif. for 2 tailed test. find .05 value. 1.96 then read over to left....see little infinity mark? that is a demonstration of size of sample you need for t to be exactly z. now, look at about 30 on left. read over to same column over the 1.96 and see that at 30 it's 2.042 - not far from 1.96! so for something close to 30, t distribution is still approx normal. now, what about something close to the sample size...see little lowercase df on first column? in this distribution, df == sample size - one. df == degrees of freedom. a mathematical term describing the parameter of the t distribution (one parameter among others) all distributions have defining parameters, this is one for this distribution. think of it as a way to look up the constant you need. for sample of 20, choose df of 19. this table on p 13a represents many t distributions. what makes t work? why is it that (see p 12 bottom, p 13.) this works out? the ratio of x bar and standard deviation/sample size chosen will follow this chart if in fact the two estimated parameters are independently distributed. the only time that is the case is when the population from which they've been sampled is in fact normal. so t is only PRECISELY t when in fact the population distribution is precisely normal. but a nice thing happens. as sample size goes up, sample distribution itself becomes approx normal. some guy karl pearson found that even when pop distrib is NOT normal, the only error in t would be if the sample distribution created from pop was not normal. so to the extent that sample distribution is not normal, t is incorrect. eg, t is still t as long as SAMPLE size is normal. this was all figured out just in time for WWII. we're just going to have to believe him because there isn't enough time to prove this in this class. but, if you draw a sample, and you don't know the std dev. or the mean, it does work see p 13... t and z values predicted to have 95% confidence. notice t is larger. for samples below 60 t gets above z. instead of value of confidence interval being 1.96, you don't knwo std dev, so you invoke t instead of z, ad the size of the multiplier is larger. meaning you get a larger interval. why? because by not knowing the std dev, you bring more uncertainty into the model, so you need to increase the interval. going back to election thing... typical projection made (voter profile analysis type thing) there's a referendum where people are voting yes or no. all votes are in. polls are closed. you get a sample size of 250 out of hundreds of thousands of voters. so far of this 250 you have 33% YES. now what happens when all votes are counted? from earlier examples we see that std dev of observed proportion is equal to the square root of (proportion times 1-proportion). so if you want to make a confidence interval for pi = yes (actual percentage of yes votes), you say: p +/- Z = p/root 250 .33 +/- 1.96 root [(.33 *.67)/250] +/- 1.96 root .2211/250 result: .275 is less than pi yes is less than .392 = 95% probable. now, the upper limit is NOWHERE near 50%, so it's really unlikely this thing will pass. ---end---