---start biostat 1.27.97--- handout 4: sampling theory ok, we've concluded the preliminaries, we've done the first experiment, we're shifting gears now into some new stuff. essentially, in what's done in medicine as well as other endeavors involving making decisions under uncertainty, there are two categories of things: one, clinical trial type thing, where experimenter has complete control over experiment - can fully control everything, determine what's being tested, assign people to different groups, etc. and this is a whole realm of research - fully controlled clinical trials - and mostly this is what we read about when looking at the journals. another whole separate area - doesn't exclude clinical research but is more epidemiological, looking at diseases, infections, retrospective studies, etc. you no longer have control. this isn't randomized. you can't ASSIGN soeone to have a dz - either they have it or they do not. so you're in an observational mode measuring attributes that already exist in the population. you have no control over it. so you can't measure EVERYONE OUT THERE, right? i mean, you can't find ALL diabetics on the face of the earth and test them....so just as in randomized design where you use small number and extrapolate, you do the same thing here. you draw a sample. eg, on election day, when some votes are in, you make a prediction over who will win, and the next day you see if you were right. now, in epidemiology, you NEVER KNOW for sure that you're 100% right. how do you draw conclusions w/o ever seeing everyone or getting a hard return of facts? by drawing a small sample from a very large pool - eg, millions of votes are being cast, but you only have access to a thousand or 1100 out of those millions - with these thousand votes or 1100 votes, you can obtain with a near exact probability basis what the results will be. so if you want to draw a sample the night of the election that is 99% accurate, it can be done. how? well, 99% of the time you draw the sample - doesn't matter where - if many samples done in different venues, 99% will be right, and one percent will err. eg, on election night, some anchor will say they're calling the election - or they will say it is too close to call, due to the 'accuracy phenomenon' if you generalize this to research, there are multiple populations of people, etc, and there are millions, and you can't measure them all...but the mechanism is the same, the theory is the same. so what we'll do now is he'll show us how if you have to estimate to true value of something in apopulation, be it of votes or diseased people, you can do exceptionally well by taking a small sample and you never need to examine everyone. some woman was upset that she had never been asked for her opinion in an opinion poll, and how could these polls be valid if no one ever asked her? well, said the expert, when we sample for things, there are usually only two or three possible answers. if it's only yes/no/maybe, it isn't necessary to contact EVERYONE. you draw a sample....and the individuals are represented within the sample. so...how many people are there out there with a given dz? millions? but we don't have to observe them all. just some.so... periodontal dz in animals example... (note: handouts 4-9 are like a textbook, not like lecture outline. so take notes) anyway.... you start out in descriptive mode and introduce a concept like periodontal dz in dogs. pop1: young pop2: middle age pop3: old the theory is at least epidemiologically, re: is this a problem in dogs, at least from a scientific standpoint to go out in the beginning and see how it is affected by age, it's not surprising that it follows the human trend....so you would expect as the age increases, periodontal dz would get worse, right? so it shouldn't surprise anyone if the graph of results for each group was a bell curve shifted to the right as age increases.(graphing the distance bet CEJ and gumline). so distance would increase with age, on average. in general this is probably not a bad depiction. but, let us consider for the sake of discussion a different possibility, just to give us some idea, however contrived, of how things can be misleading in some situations. say you have bell curve w/avg distance = zero for young dogs. then for middle aged dogs, you have an asymmetric curve, with peak shifted to left, but with longer extension to the right - eg a skew to the right with a hump to the left, and then for old dogs a BIG hump to the left, and a huge skew to the right (eg, very few animals with very SEVERE dz. but in all cases the MEDIAN is the same. in young dogs, mean and median are same, as is mode. but in middle aged animals, you have a skewed high point, so it's monotonic, nonsymmetric, single mode...so average value moves toward the extreme value. so the median is less than the average. median depicts EXACT CENTRAL TENDENCY. the AVERAGE accounts in for extreme values. mode, recall, is most frequently occuring value. so in a distribution of disease, the average value under a symmetric view is superimposed over the median, but when there are outlyers, the mean is pulled toward the extreme value, eg there is a distortion of central tendency. central tendency is supposed to represent to the veiwers of the data what the central tendency is. the median shows central tendency regardless of bizarre extremes. always half in front and half behind it. eg, a single value best describing a larg number of values for LOCATION, not spread. now the AVERAGE or MEAN is easily distorted by a few extreme values - when you lose symmetry, you get a distorted view of the data. eg if median income is 30K/family, the average might be 45k/family. now, 50% of people are making 30K or less or more, but a few people are making a million or more, so the AVERAGE is higher. this is the point of central tendency. but usually, the media give you the average, not the median. so in the dog thing, the median stays the same, but because a few dogs get bad dz, so the average would lead you to believe that OVERALL as dogs age they get worse dz - but in our depiction, on median, dogs stay same, but a select few get dz. (in fact, this depiction is bogus, but it's just an example. in real life, whole bell curve shifts to right.) so data can be misleading, and using the average and not the median can be a mistake. suffice it to say that it is reasonable to compare distributions as long as they are comparable. if you ahve to use averages or to compare skewed distributions, it's ok as long as populations are relatively similarly skewed. however, in extreme cases eg if population one is skewed with a right hump and a left extreme, and population two is skewed with a left hump and a right extreme, your averages would be mistakenly calculated to be very far apart, but that would be wrong. another view of comparative distributions: differential variation as opposed to central tendency or location. can draw two distributions on top of each other. same average value but pop1 is wide bell curve and pop2 is narrow bell curve. by inspection 5% of observations are small areas under edges of wide curve and small areas under edges of narrow curve. therefore 90% are in middle. in pop1 it's wide, dispersed, and in pop2 it's very narrow. so it would seem that in pop1 there are more lows and more highs - why? because it is so dispersed, that only very far from the mean do you start getting to that last 5%, whereas in pop2 the 5% part starts much earlier. now, for each curve you could calculate as in handout 3 - the VARIANCE, the square root of which is standard deviation. (see handout 3 for equation). in the distribution which is more dispersed the variance will be much larger than in the non-dispersed distribution. so the standard deviation of the narrow distribution will be smaller. seems logical. now. let's take another step and investigate another handout3 concept. these are preliminaries to handout 4 which we shoudl get to next time. remember from handout 3 our friend the Z value? what was it for? for individual scores varying about a mean, you could calculate a Z value which if the distribution were normal would allow us to calculate areas. this will be revisited in handout 4, but we won't look at individual values. before that, we need to get some notation and a concept.... when we say a distribution is normal...there is notation used to write that down. by convention, to say whether or not a system of scores is normal, we say that Xi is distributed as normal curve with mean mu and variance sigma squared. sorry, can't write the equation with this text editor, hope you wrote it down :) Some polish mathemetician pointed out that P[(Xi-µ)„ k*sigma ¾ 1/k^2 irrespective of shape of distribution, if you consider departure of score from mean relative to a number of standard deviations, can guarantee probability of particular score deviating from mean being no greater than 1/k^2 i have no bloody buggering idea what he's talking about. say you're two standard deviations from mean. K=2. 1/2^2 = .25, so you can't be farther than that from the mean. [what?] without knowing anything and excluding the normal, this is guaranteed as an upper bound. you can do no worse. conceptually, the analogy to z is k. the nubmer of standard deviations a score is from the mean. if it isn't normal, you don't know what it is, you can use this equation that was developed by Chebyshev. ---end---