---start 1.28.97...biostat--- note: septa problem this morning... today we start handout four: sampling theory. why sampling theory? several reasons. -to estimate something you can't totally measure: eg, election results, disease level or reasons for disease eg epidimiological study...in these studies you have to make estimates under uncertainty because you CAN NOT use EVERY possible subject. so you use sampling. so...you don't think this procedure...so you don't think there's anything funny going on...he's going to contrive small examples where they can be presented in their entirety, so we can see entirely how this works. eg, in medicine, brain tumor study would have 10-15 subjects. vaccine trial maybe a few hundred people...but, for vaccine, will be using it on MILLIONS of people. so we need to understand that this sampling stuff really works. also: when comparing populations, if comparing two or more populations, in a clinical trial or drug or procedure - again, you only draw small samples. you use a small # of observations to compare the situations. but you can make precise measurements about how likely your results would be in a much much LARGER population. so. if you have a population PoP, of size N where N is very large, there are two items of interest: mu (µ) == the average, the central tendency locator (disease level, age, shoe size, height, whatever.) sigma^2 or == standard deviation sigma if you have average and standard deviation you can describe percentiles, averages, etc. and you can use these values to compare between populations. from PoP of size N you draw a sample of smaller size n. from n observations you observe SAMPLE AVERAGE (sum of observations divided by sample size) and you get standard deviation and variance. _ x = sample average s^2 = sample variance s = sample std deviation these are all statistical estimates of the true population values. now. without getting too technical...at the minimum it would be useful if each of these estimated values on the average was equal to the thing you are trying to estimate. if they are not, they are considered biased. so, what does that mean, "on the average" equal? in the handout it explains about "expectation". if you consider drawing a sample of size n from population N, how do you do it? You randomly draw n individuals from N and use them. now every time you draw a single sample in a real application, you think - there are many such samples i could have drawn...HOW MANY could i have drawn? i mean, how many possible different n's could i have drawn? if i randomly take 120 of 120000, there are how many possible combinations of 120 i could have drawn? N C n = N* so you need to consider this... if you had a computer actually draw all possible n sample sizes for you, and calculated bar x for each one, and then averaged them all, eg, bar x + bar x etc divided by N* --> it SHOULD give you the actual average mu. if that is the case, the procedure is said to be UNBIASED. this seems painfully obvious to some and unclear to others. any bowlers out there? usually bowl a set of 3 games and average them. so they give you an average each time, and ultimately you get a season average. and it is kind of the same thing, in that the expectation of the daily averages comes back to the seasonal average. Expectation of the sample average is defined as "average of everything you would see if you could produce all the samples." this is in handout. BUT expection of the sample variance is NOT equal to the actual variance. eg, if you did the study on ALL POSSIBLE samples, and figured out the sample variances, and averaged them, it would NOT be the actual variance. so, some guy figured out that instead, you are getting a bias of (n-1)/n eg: the expectation of sample variance is going to equal (n-1)/n times the population variance. so if you premultiply your variances by n/(n-1) it cancels out, and you get the correct variance (check this in handout). (in fact, all math should be studied from handout. it confuses the hell out of me.) so, sample variance sub n-1 equals the sum from 1 to n of the squared deviations about the mean, divided by n-1 --> so this is an unbiased estimate of sample variance, so you can take the square root to get an unbiased standard deviation. [i am sooooooo tired.] [chataqua anecdote was told. *yawn*] ok. he's drawing a bar graph on the board. N = 8 µ = 4.5 sigma = 2.2913 | | | | f| | | |------------------------------- ---|---|---|---|---|---|---|---|-- 1 2 3 4 | 5 6 7 8 Xi | 4.5 Ok, so pretend we don't have all 8. we're going to use sample size n = 4 we need to figure out bar x and sigma sub (n-1). 8C4 = 8!/4!4! = 70 ...so there are 70 possibilities. so f you look at the handout, you see the bar x results 2.5, 2.75, 3.0, 3.25, 3.5, 3.75, 4.0.4.25, 4.5, 4.75, 5, 5.25, 5.5, 5.75, 6, 6.25, 6.5 there are 70 ways to get sample of size 4 there are 8 ways to get an average of 4.5 there are 3 ways to get an average of 3.25 there is only one way to get 2.5 or 2.75 or 6.25 or 6.75 now, note that the population is a flat line in the first chart, above. the sampling distribution is looking bell shaped, though (see handout). --break--- so. are we all clear on this hypothetical distribution of all possible samples we could have drawn? that's what i talked about just before break. the bell shaped thing. now, the actual mean, 4.5, is the same mean of the sampling distribution. the standard deviation of the population is 2.2913 the standard deviation of the SAMPLING DISTRIBUTION is not *called* the standard deviation...it's called the standard error, or the standard error of the mean. this is purely a semantic differentiation. mathematically, it is still a standard deviation. so, you substitute bar x for x making your calculation in this situation, and in this FINITE population you can calculate the standard error by root mean square or by algebraicly noting that there is an equivalence between that and by taking sigma divided by sqare root of sample size and multiplying by sq.root of N-n/N-1. this is another way of calculating standard error. and in this case it is 0.87 = s. so now we can define this distribution...n =4, µ = 4.5, s = 0.87 SEE HANDOUT PLEASE. erm. wait. he just said the population isn't described pictorially in the notes. *sigh* typical. well, it has a close to normal distribution, and looks bell like. ok? now, how do we tell if it's a truly normal distribution? Z = Xi - µ ------- sigma if you shade in the right side of distribution 3,2,1 and 1, you have 7/70 eg 10% of the total area. So when drawing sample of size 4, the chance of observing an average of 5.75 or higher is 10%. same with 3.25 or lower. in real life, we wouldn't know these exact numbers. remember this is a sampling distribution, not a real population. so we change Xi into bar Xi in the Z equation, and we make sigma == sigma sub bar x. so we can map the normal deviance onto this curve (or did he say normal deviants? nice name for a statistics team or something..."the normal deviants") so if Z = 5.75-4.5/.87 == 1.44 this is not 10%. um, these blocks have width. we need to adjust for continuity using 1/2 width of histogram block, and subtract it. if we do that, then: Z = (5.75-4.5) -.125/.87 ==1.293 and that works out to be 0.99 ???? what?? oh, oh, right. you have to look at the table of z values. so this is correct. i don't know what the value for 1.44 was but it isn't that close to 10%, whereas obviously with the correction it works out well. what do we learn from this? by applying normal mapping feature and continuity correction, we see that the sample distribution is in fact a bell shaped normal curve. so even for small samples, this very quickly becomes normal. the continuity correction is 1/2 of 1/n. this value is subtracted from the numerator. now the limit of this as sample size goes to infinity, is zero...because n is in the denominator of the correction, see. so it is negligible as n grows very large. another thing that happens, when you consider the equation of standard error... sigma bar x ^2 = sigma^2/n * N-n/N-1 ok, now he's writing limits and infinity signs and i just can't deal with this math.as N goes to infinity the limit goes to one, is his point. so, that leaves you with sigma^2/n as the value of sigma bar x ^2 or, put another way - sigma over sq.root of n is sigma bar x squared. [ha. he asks for questions. i don't KNOW enough to ask a question. what could i say? "excuse me, sir, but what the hell are you talking about??"] confidence level. two numbers, low and high numbers. these define the confidence interval. these numbers have a probability value of trapping the true value you seek. you hope the value you seek is somewhere in that interval. as your sample size increases, the limits of the interval get closer and closer together. you don't have to know precise true value, but you get precise limits, very close together, and you know the true value is 99% of the time within those limits, and that is useful. before we talk about that, we have to learn to walk first, so to speak. so µ +/- Zsub alpha/2 * sigma sub bar x. Z sub .057 = 1.58 4.5 +/- 1.58 * 0.87 P[4.5 + (1.58 * .87) ¾ Xi ¾ 4.5 - (1.58 * .87)] = 1-2(.057) = .886 = 62/70 3.125 5.875 so Xi is between 3.125 and 5.875, i guess is what he's saying. if you went back time and time again and took 4 out of these 8 items, your sample would vary 66% of the time between these values. 66%? where did THAT number come from? I would have thought 88%! I am so confused (sob). um, now he says 88% of the time you will observe a value between 5.75 and 3.25 and 11.4% of the time you will will have means outside that range. But, where are 5.75 and 3.25 coming from? what happened to 3.125 and 5.875? AAAARRRGGHHH!!! so you have these confidence ranges produced arithmetically based on normal distribution. you add/subtract Z * standard error to/from the mean and get this range. and you have an answer that works 1-alpha % of the time. (?) if you now remove the µ from the equation and replace it with bar x, you get the limits around a µ value...eg, now you are predicting µ under uncertainty. if you pretend you do not know the mean, and you use this equation, you take sample averages (bar x) and add or subtract Z * standard error, you end up trapping the true mu within the confidence range. see handout. if you pick z in a real expt at approx 2, if you produce confidence intervals and put in 1.96 for z - about 95% of the intervals produced from samples you could draw would trap the true population average, and 5% would miss. now, if you raise z to 2.58, you drop the miss % to 1%. but, as you make z bigger, the interval is wider, so you don't miss as much, but it is less useful scientificaly. in the remaining 7 minutes we'll try to see how this works. can't just take ANY z, we have finite population in our example. let z = .029 (eg, 2/70) Z 0.029 = 1.90 barX1= 2.5 barX2= 2.75 barX3= 6.25 barX4= 6.5 pretend we don't know these exact values. we should be able to produce confidence intervals for these which miss µ, and all others should trap it. so, 2.5 +/- 1.9*.87...and so forth for each x value you get as follows: .847 ¾ µ ¾ 4.153 1.097 4.403 4.597 7.93 4.84 8.153 these are the four misses. the balance of the time we would win. miss 0.57, win 0.943 --> eg any other average you pick will win. so, if you avoid the tails of the curve it should work, is what i guess the point is. if you pick a 3.0 for bar x and add or subtract 1.9*.87 you get limits of 1.347 and 4.653, which does trap 4.5 this is apparently a brilliant concept. i fail to be enlightened, but hey - what the hell do i know? if you take sample mean and multiply some z value, say 2, and take 2 times standard error for that sample, and add/subtract from that single sample avg you get upper and lower limits w/95% chance of trapping mu. now, do you KNOW if you are right or wrong? well, no. but if you are right 95% or 98% or 99% of the time.... if you make sample larger, standard error gets smaller, so you're making interval very very narrow and getting very useful information, and don't need to know precisely what it is. ---end---