---start biostat.lec.02.12.97--- finishing off handout five, starting on pg 12. another chance to ask questions about sample size and power.... there is a demo in handout 5 of the same concept, solving for a sample size planning an experiment or finding the power of an experiment that has already been done. p 12 handout five the assumptions we've been talking about involve t and z approximations. summary of what effects of departure from these assumptions is on this page. this handout has extra stuff in it. go to bottom of p 12. the bottom paragraph: the normality assumptions.... all the way to p 13 up to the part about homogeneity - we don't have to know about. so. effects of departure from the assumptions. if you use a z or t test under circumstances we've depicted so far, these distributions are APPROXIMATIONS to exact probabilities which could be calculated if we had the ability to get exact probs all the time which we don't. so we use these approximations to exact probability bases the first assumption is that the scores are in fact gaussian (Xi is proportional to N(mu,sigma^2) (top of p 12) second assumption is that the variances of the two populations are equal (sigma ^2 =sigma^2. final assumption: independence. the way independence is seen is that scores must have a normal distribution under null w/mean mu and variance sigma^2, and that the scores are independently distributed (?) now, if these assumptions are violated and samplesize is less than 30, t distribution may not work well. you could get the wrong answer that way. at least the first two assumptions, if they can't be assumed to be true, as long as you have a large sample size, become irrelevant. with large sample sizes you don't need to worry about one and two, because even if population being sampled from is not normal, the sampling distribution generated from all samples you might draw will be normal. and that will allow t or z to work - to give you a correct answer. see bottom p 12 handout 5. when a procedure is not affected by violation of assumptions it's said to be "robust" (see bottom p 13). so for first assumption, normality is an easy one to dispel because of central limit theorem and gauss. we know as sample size gets larger, sampling distribution becomes normal, and so t and z are viable assumptions, even if underlying assumptions are not met (eg, population itself is not normal) table 2 p 14...this is out of a text book writte by henry someone. the R values are sample sizes in some hypothetical two sample example, and theta is the ratio of population variances which might exist. whatis in the table is .. if you're predicting a confidence interval from this table, for the true difference mu1-mu2...from a two sample example...instead of just doing a t test you can also make a confidence interval, and if you pass that, say at 95% confidence....if you have 95% confidence of trapping true mu1-mu2, then 5% of the time you will MISS mu1-mu2. so in this table it says if you WERE to do this, under simulation, with fair size samples, varying the two sample size, and varying the true population variances, how accurate under simulation, the 5% miss rate actually is. if you look at the table, [mutter mutter] what he's showing is when is the actual alpha rate, type one error rate, equal to the one you presume it to be? it's a statistical simulation. notice that when the sample sizes are equal, when you have two equal sized samples of fairly equal size, the equality or inequality of the variances - eg, the variance ratio - has no effect on the type one error rate, which remains 5%. so you can't go wrong. so the t test is robust, it's ok to depart from the assumptions as long as sample sizes are the same. then, if sample sizes are NOT the same, and variances ARE the same, type one error is again not affected. what have we learned? if sample sizes are equal, don't have to worry about variances, if variances are equal, don't have to worry about sample sizes. so given sample sizes are very close together and reasonably large, you're quite safe in using the t test or its approximation to z - you're quite likely to not make a mistake using these statistical procedures if sample size is large. and after this was documented, even today you'll find statisticians saying even in published articles that data didn't appear to be normal, so that PI couldn't do t test. really, no reason to say that unless samples being used were very small and you're suspicious of the population. if you're doing 2 sample example using >30 per group, it does't matter if they are normal or not. it's very unlikely to be a problem. so when people say that they're generally wrong. some people don't understand this. now. questions about violation of normality assumption and equality of variance assumption? notice we didn't discuss third assumption. good reason. third assumption has no robustness. you can't defeat lack of independence by raising the sample size. when youhave a lack of independence you make situation worse, not better. there is no answer to being able to deal with dependent observation in an independent observation model. there are ways to cope but not solve it. independence doesn't change or go away w/increase in sample size. normality, equality of variances: violating these doesn't cause much problem when you have reasonably sized samples independence however does not go away w/increased sample size. moving on bottom p 14 randomized assignment. concluding the handout is another view of what happens with two samples under the other paradigm - randomized assignment. i can consider if two groups are equal or not in at least two ways: plan A: already introduced and discussed - population a, pop b, see if average in pop a is same as avg in pop b - draw two samples one from each, and compare sample averages - justification: joint central limit theorems , normality. briefly, say again, we set that up like this Ho: mu1=mu2 Hi: mu1 not = mu2 T or Z = barx1 minus barx2/s sub x1-x2 plan B another way...if you wanna compare a drug to another drug or anything where PI has control over experiment (as opposed to comparing herd a to herd b or something where you have no control over expt, where it's just sampling a natural population) you take a group of subjects, randomly put them in group a or group b ( or c, d,e, whatever) stick w/2 groups for now. you randomly assign them a group. that creates a basis for being able to test a very similar hypothesis with identical mathematical underpinnings. if you have tx a and tx b and two groups of subjects with homeogeneity of disease state, and you divide patients into two groups by random process. you try tx a on one group and tx b on other (or control or whatever) then compare. if the average response in one group is higher or lower than avg response in other, it seems that one tx is better than other - in general, w/o getting into substance of how it works. that's the basis of modern controlled clinical trials. remember handout one. the mere act of randomized assignment could produce this difference. if tx a and tx b are REALLY equivalent, but by dumb luck, group a had more patients who were going to get better ANYWAY, at the end of the study, it looks like tx a did better but really it was just luck. this could happen. eg, if 20 people all get drug a, they don't all have same outcome. some do ok, some die, some get totally cured. so depending what group people are assigned to, it might falsely appear that one tx is better when it isn't. so you divide subjects in two groups. under the null, youshould expect equal response from two groups. but even if null is true you might see unequal response from groups, based on chance, luck, etc - like getting 15 heads in a row from a fair coin. analgesics: someone was trying to demonstrate that a new one was better than an old one for postop pain in k9s. philosphically there's a problem with this - how do you know if dog is in pain? the idea is, it's up to the nurse/vet to judge pain. you examine the dog, check hr,rr, att, app, etc. try to judge level of pain. there are two ways to rank pain in this model. p 15 of handout (from article by marks et al from annals of int. med 1973) has to do w/scoring pain individually on a scale of 0-1-2 and then adding up the assigned ordinals and come up with total distress scores. 18 would be a 2 in all categories. that's one way to assign pain here - so two or more animals COULD get same score. another way to do this better for patients who can't answer questions is not to assign series of ordinals to animal, but to rank animals. eg if you have 8 animals, compare them to each other, and if you ahve a small number of them, rank them in order of least to most pain. 8 animals having same sx. same breed, same size, etc. prior to sx w/o telling sxon, assign 4 to analgesic a and 4 to analgesic b. after sx they get the meds. then a nurse or vet comes in, w/o any knowledge of who got what drug. dogs don't know who got what, either, so this is a double blind exp't. nurse writes down opinion of who has what pain, gives info to statistician, who has the code about who is in what group? then you can see the results. top of p 16. new analgesic old analgesic ranks 3,5,7,8 1,2,4,6 avg 5.75 3.25 note that 8=most pain, 1=least pain. average is lower for old drug. seems that older drug is better analgesic, but that COULD be by chance. now in sampling model there were 80000 cows that had a true population average that you just couldn't figure out without measuring all cows. here, population averages are not like that. here, fi you take another set of dogs, do exp't again, over and over and over and over millions of times, if after doing all that, the averages we got were the same, then the null is true. if not, null is not true. here, mu1 and mu2 mean ACTUAL response to treatment after EXHAUSTIVE testing. the nulls are the same but interpretation of averages is sl. different. in original model we dealt w/actual population averages...not so here. questions? no. ok. he's going to draw a diagram while we take a break. ----break---- note that in the case of VERY large >100 samples per group, [he didn't seem to finish this statement] there's a whole bunch of stuff on the board; it's meaningless to me,sorry. so. dogs were ranked after sx and statistician breaks code, results as above. so there is an exact basis for htis probability calculation as you see. how many ways COULD you have divided the 8 dogs into two groups? combination of 8 dogs taken four at a time...there are 70 ways to do that. 8 of those ways give you equal averages on the baseline. there are 8 ways to make equal averages such that both drugs appear to have same efficacy. there are 5 ways to make it so one group is two units in difference from the other group. see histogram in handout p 17. so in the specific example we have, the difference observed is 2.5. looking at the histogram, we see that there are 3 ways of getting that difference. there are 4 ways to get a greater difference. that's 7 out of 70 total ways, so that is a 10% chance on the one tail of observing a diff of 2.5 or greater by chance and chance alone. any questions? no? ok now. we don't bother to make these histograms anymore, sometimes we do it by computer if needed. here there are only 70 but if you have a 50 unit sample, the number of exchanges possible is astronomical. and computers can't do it due to software limitation, presumably (?) let's discuss a two sample t test. how do we do it? we have a diff of 2.5 and can calculate sample variance from the observations. it turns out here we have identical variances of 4.92. the sample size is 4 so we take diff of means plug into t formula the standard errors and we have to make a correction in this case, by subtracting .25 from the numerator - why? because of the width of the base of the histogram blocks, which change by units of .5, so we subtract half ofthat. also note that in general for two sample example continuity correction is equal to one half of the two inverted sample sizes added together. now calculate the t value w/6 degrees of freedom. why? 4-1 + 4-1 = 6. t with 6 degrees of freedom comes out to 1.43. looking at t table to get 10% probability t value is 1.44 - that's very very close. at this point some people say that seems contrived, this is just one set of numbers, what if you use the other two ways of getting same data? will they come out the same way? not only will they, they have to. see p 19. all three ways of getting the 2.5 difference are shown.the sample variances change for those ways, but the upshot is that their sum is always the same. moving on to next section (note: we will get homework assignment today after all) now as promised earlier we will return for the balance of the hour to the concept of independence and what happens when it is violated. sometimes the violation of independence, nay the actual purposeful installationn of dependence, can be helpful. sometimes you make sure you have dependence to help experimental situation. we've already seen that in matched pairs exp't - the pairs were designed to be dependent. if we'd gotten rid of dependency by taking difference between them instead of the way we did it it woulnd't have worked well. consequence of making assumption of independence when in fact you don't is..? this is a hot topic in recent years. independence is the one thing you can't get away with easily. increased sample size doens't releive problem of dependence. when you employ such observations as if they were independent and they were NOT, you wil mess up as follows: the variance around sample mean of single sample eg sigma^2/n -->we take sq root of this to get standard error. t distribution based on xbar minus the mean, all divided by the standard error. t is assuming independence. if the observations are NOT independent..if you assume they are when you calculate the mean, the actual standard error is going to be much larger than the one you calculate non-independent observation sigma^2/n will calculate out smaller than the REAL value. if observations are made about a patient, say five or ten - say, diastolic BP taken five times in an hour, on each of 80 students in a room. this is 400 total observations. say you calculate std error based on 400 observations and 399 degrees of freedom. well, see, these observations weren't independent. the 5 on each person are DEPENDENT on that person. you don't have 400 replicate observations, you only have 80. so your estimate of variability is largely reduced from your mistake. there is MORE variability among the population than it would appear if you pretent to have 400 independent observations [oh my god...I UNDERSTAND SOMETHING!!! :) ] so, that's how you can screwup. you underestimate variability, and overestimate the confidence level. because denominator will be underestimated. that's the downside. believe it or not there's an upside - sometimes when appropriate like w/puppy pairs we build in dependency by design, to work to your advantage. how? read on. two topical drugs used to reduce skin ulcers on livestock. we're calculating "ulcer days" eg days that ulcer is there, per ulcer. say you have five cows or pigs or whatever. try them on drug a or drug b animal drug A drug B diff 1 3 1 2 2 7 4 3 3 4 2 2 4 5 3 2 5 2 1 1 ------- ------- --- AVG 4.2 2.2 2.0 est. var. 3.7 1.7 0.5 crossover design; test on drug A first, wait for effects to wash out and then test onn the other drug. some test A first then B, some test B first then A. this isn't a good idea, FDA doesn't like it, because youhave to assume there is no carryover affect from first drug, and if you wait long enough so that there can be no effect, animal isn't his own control anymore, cause disease has progressed or regressed naturally. but anyway, it's a good model. so, estimated variances for A, B and Diff 3.7, 1.7, 0.5 we're testing five animals. if you just look at the difference produced for cow one going from a to b, cow two from a to b, you only have one set of numbers. are those numbers dependent on eachother? NO they're from five different cows. null: Ho: muA = muB OR muDiff = 0 look at variances involved. 3.7 under A regimen, 1.7 under B, but only 0.5 under the difference. which is considerably smaller. because not only did we take the difference, but the difference was obtained within each animal, like taking BP on same animal 5 times...it reduces possible variation by testing the same subject again. responses within the same animal should be more dependent on each other and should group more tightly. this is natural followup on matched pairs expt from handout two. can you imagine exact basis for this test? under exchangeability, should be able to exchange the numbers in a and b columns but t test works just as well and is a lot quicker. how many degrees of freedom? we have 5 independent scores (getting rid of dependency by using diff numbers) so 4 deg of freedom. t4 = dbar/root s^2d/n 2.0/root .5/5 ==6.33 a t value of 6.33 yields a probability of less than .01 so this was done correctly. we took the five animals and got 5 independent observations by calcuating the differences between the dependent observations, and calculated out a nice big t value, finding out that probablility of null being true is very small. but WHAT IF you forgot about dependence? then you'd say you had 8 degrees of freedom (4+4), you'd take the two sample variances barXa-barXb and do the calculation as above and the t value is only 1.92 now, so p >.5 eg, the t test failed. so even though you gain 2x deg of freedom, this fails, because it is incorrect and inappropriately applied. ---end---