---start---- epidemiology 1/14/98 [amusing anecdote deleted. today's anecdote involves Seneca and belief in legends in myths and things. hey - he just said that Paul Revere didn't shout "the british are coming, the british are coming" but rather, "the redcoats are coming, the redcoats are coming," because at the time, he thought he was British. Now, I don't know about you, but I was taught all along that he said "The redcoats are coming," for exactly that reason, I guess, and I don't know what this "myth" of "the british are coming" is all about. he also remarks that the kilt was both invented and banned by the British, and was taken up by Scots after it was banned by British.... point: there is a reluctance to leave a belief behind in the face of reason. today we'll discuss causation, and what it is necessary to establish before you can believe that one thing causes another. association vs causation: if a disease is more common in a group with a certain characteristic than in the group without it, there is an association b/w the characteristic and the disease. if the characterisation is present prior to the disease, it is called a risk factor. this does not imply a causal relationship. exposure to any particular factor may be accompanied by a higher risk of disease even though the factor doesn't cause disease in any way. so, think of a risk factor as a marker which designates one group as being at greater risk than another group. case-control studies: a design in which animal*time at risk is sampled rather than being measured directly. advantage of cohort study is it is the only way to measure incidence - but cohort studies have many disadvantages - time, cost, # of subjects. case control studies, while they do not measure incidence at all, directly, have many advantages. the number of cases required is unconstrained by the natural frequency of disease, and large numbers of possible risk factors can be explored. since you start with your cases, you're off to a running start. you don't start with healthy animals like in a cohort study, and wait to see how many get sick. Also, you can keep thinking of hypotheses once you've started. now, once you've divided them by exposure, in a cohort study, you can't undivide them, but in a case-control study you can look for other things. epidemiologica investigators set out to: demonstrate an association exists measure the strength of the association see if the association is causal. the first two are accomplished by case control and cohort studies, but neither of those studies can determine if the association is causal. association occurs when two events occur together more often than predicted by chance alone. remember that both case-control and cohort studies test the null hypothesis that there is no association, in which case RR is 1 and OR is 1. If we find that RR or OR >1, we don't immediately think there is an association. why not? because usually we're dealing with subsamples of animals. we're not sure if there was some sampling error. so we have to look at degrees of certainty - confidence intervals. clearly, if the lower bound of the confidence interval is greater than 1, we're reasonably certain that RR or OR is also greater than 1. So, first we estimate RR or OR, see if >1, then establish confidence interval, look at lower bound, and if it is >1, then your measurement is probably also >1. so this is how you decide an association exists. slide: confidence intervals that we are not supposed to write down. there are various techniques for measuring confidence intervals. there are multiple ways of doing it. prospective cohort study: RR = 1 - no association RR > 1 exposure increases risk of disease RR < 1 exposure decreases risk of disease or OR for case-control study - same as above. cohort studies measure RR relative risk, case-control studies measure OR odds ratio. RR = 1 --> no association --> null hypothesis. our study shows a 95% confidence interfal with an upper bound above 1, and a lower bound also above 1, so there is definitely an association. fine. BUT - a different study shows a relative risk with mean value greater than 1 - but the lower bound of the confidence interval is less than 1, and the upper bound is very much greater than 1. a classical statistician would say the experiment sucked, the answer is too precise, the low boundary of confidence interval is below 1, you can't conclude anything. There is now a movement among american epidemiologists that there is a difference b/w statistical inference and clinical inference. the confidence interval of 95% should tell you that if you did the study 100 times, the answer would be in that interval about 95 times. If you think about it like that, then this is a relative risk consistent of equal to or less than one *and* equal to or greater than one. you look at these answers and you say well, it might be a risk factor, and it might not be a risk factor. so, you might say that you should avoid this factor, until further studies are done. as a clinician, you can't tell people that it *isn't* a risk factor based on this study. basically, it's simply prudent to avoid such a factor pending further study, especially if upper bound is way higher than one and lower bound is just a tiny bit less than one. statistically significant associations: a statistically significant association does NOT imply a causal connection. furthermore, in declaring the statistically significant difference b/w the incidence in the exposed and unexposed groups we mean only that we're reasonably certain that sampling error didn't produce the difference. so, how do you decide if the difference is part of a causal chain, worth pursuing clinically? three types of association: A <----B here, risk factor B causes disease A X / \ here, X causes A and B, where A is disease, and B is a "risk factor" A< >B A ---> B here, disease A could itself cause B, so there is an association, but risk factor B doesn't cause A the only useful conclusion is the first one. you can't prove a causal connection. **you can not prove causality** write that down! but, you can *infer* a causal connection, given that certain criteria are met: a) the factor and disease are associated b) the factor precedes the event it is supposed to cause (temporality) c) a reduction in occurrence of the factor is followed by a reduction in the occurence of disease (reversibility) obviously, a factor can't cause a disease if it doesn't show up until after the disease has occurred. there are also other criteria some people suggest, some of which are in the handout. but remember, causal inference is tricky. there is still a general view that what constitutes a cause is much too simple - look at Henle-Koch postulates: 1. the organism must be present in every case of the disease 2. the organism must be isolated and grown in pure culture 3. the organsim must, when inoculated into susceptible animal, cause the disease 4. the organism must then be recovered from the animal and reidentified. these postulates were very useful in their day. they were very useful guidelines. the important step from anecdotal evidence and superstition to germ theory was really helped along by these postulates. note: rice-water feces of cholera isn't rice-water mixed with feces. it is a characterization of the feces produced by a cholera patient. anyway, these postulates were useful in their day. the problem with them is that they led to a much too narrow view of causation. each disease was perceived as having a single cause, and each bacterium as producing a single disease. therefore, many infectious diseases' etiological agents fail these postulates - he says HIV does, but i'm not sure it does. anyway, they fail to account for the fact that causality is a relative concept. the cause of any effect is the set of component causes that act in concert. a component cause is an event or condition that is essential in producing an occurrence of disease. lots of events have to coincide before a disease occurs. in addition to a virus or bacterium invading the host, you must consider the strength of an immune response, which may be affected by the age of the animal. disease is multicausal. we can say that the strength of immune response is a cause but even that can be broken down into age of host, prior exposures, genetic differences - so every cause generally has a hierarchy of subcauses beneath them which are themselves legitimate causes. ** disease is multicausal** classification of causes: a necessary cause: one without which the disease can't occur. if you don't have it, you will not get the disease. herpes zoster is a necessary cause of chickenpox sufficient cause: that set of component causes (including one or more necessary causes) which always leads to disease sufficient causes: examples 1. A, B, C, D, E togehter cause 20% of cases 2. A, B, F, G, H together cause 50% of cases 3. A, C, F, I, J together cause 30% of cases. A, therefore must be the necessary cause. Notice that A will occur with various other causes. If you have these combinations you will get the disease. Some diseases are most common in the young, old, or immunocompromised. Say sufficient cause 1 includes leptospira, being an infant, and 3 other causes. Say 3 includes leptospira, being elderly, and 3 other causes. 2 includes leptospira, neither youth nor age, and other causes eg alcoholism, concurrent other infection, whatever. these proportions, the % of cases caused by a given sufficient cause, are called the "attributable proportions", which is the proportion of all cases that is attributable to a particular cause. it doesn't matter what cause you apply that to. you can apply it to sufficient causes, as done here, or to component causes. what's the attributable proportion for cause A? 100%, because all cases involve cause A. what about cause B? 70%, because that's how many cases involve B. C? 50% of cases involve C, so that's the attributable proportion. etc. by definition, the attributable proportion of a necessary cause is 100%. ---break---- no one likes being criticized "american critics are like american universities; they both have dull and half dead faculties." (some artist said that after being criticized...) "a critic is a man created to praise greater men than himself...but he can never find them." "critics are a dissembling dishonest contemptible race of men asking a working writer how he feels about critics is like asking a lamppost how he feels about dogs." "he always praises the first production of the season, being reluctant to stone the first cast." ha ha. epidemiologists are very self-critical. the discussions of their papers tend to be dour accounts of why the work probably isn't really important. Validity and Bias: validity is the extent to which the study measures what it is supposed to measure. a study is valid if it has little or no systematic error (systematic error == bias). There are three general types of bias: -selection bias -information bias -confounding bias is actually a subtle concept. Selection Bias: sometimes the way in which we select animals in a study leads to a sample that isn't representative. This may arise by chance, or it may arise because the selection method *systematically* selects one kind of animal in preference to all the others. This is called selection bias. we hope that the sample we choose is representative of the population we are studying, but sometimes something about the way we choose subjects leads to a sample that isn't representative. That's selection bias. Probably the best definition of selection bias is from the human literature. There was an investigation into the incidence of leukemia in those troops who were present at atomic tests in the 50s and 60s. the investigators had to then find those people who were present at the tests, years later, and recruit them into the study. How did they do this? they tracked followup records, social security numbers, etc. also they advertised in newspapers, and waited for people to call them. method one, tracking of followup records, turned up 82% of the total people they found, and method two, self referral, got 18% of the people they ultimately found. Interestingly, 4 leukemia cases were found by method 1, and 4 by method 2. One would have expected to find fewer cases among the self referrals, since there were fewer people in the self referrals. Why? well, there was an increased tendency for those with both the exposure and the disease to self refer, due to the hope of getting paid or winning a lawsuit. Information Bias: occurs both during and after subject selection - a systematic misclassification of animals. 1. misclassification (diseased/healthy or exposed/not exposed) is a common source of information bias - over or underdiagnosis 2. recall bias is another big example, very common. owners of colicky horses remember exposures more thoroughly than owners of healthy horses. when researchers call up owners to ask about risk factors in sick or healthy horses, the people who owned horses that had colic were able to give a lot more information. also the people who had horses with colic had learned about it a lot, so the people would emphasize the things they thought were causative and leave out other things. interesting example of recall bias: it's a longstanding joke that if you ask men and women how many sexual partners they've had in their lives, it should be roughly identical, right? but it never is. men always report a higher figure than women. this is really important for HIV studies, but obviously someone is not telling the truth, I think. what on earth is going on here? someone plotted it as a histogram, comparing the averages reported by men and women. up to about 20 partners, the averages were the same. later, there were jumps in the data at 30,40,50,100,200 - this is called "heaping". people start estimating, because they can't really remember. so this is where the statisticians (and social pressures) come in. when men estimate, they overestimate, and when women estimate, they underestimate - so figures become more disparate. when this data was presented 4 yrs ago, the researcher was going to say that statisticians call this "heaping" but she said "the statisticians call this 'humping'" by accident, or so the story goes. this is a classic example of recall bias, though. this isn't a trivial research problem. Controlling bias: selection and information biases cannot be controlled after the data have been gathered. once it happens you can't do anything about it. you must try to forsee possible sources of bias before you start to collect the data. if you realize bias exists, you must interpret your results very carefully. this is the only way to get over these forms of bias. You deliberately adjust your methodology to try to eliminate it. If you didn't eliminate it, and you've done your study and you realize the bias is still there, you then have to write a lot in your discussion about your interpretation...it's possible, in qualitative terms, to say how the bias will affect the RR or OR - some kinds of bias will result in underestimation of OR - so if your OR is >1 and low end of confidence interval is >1, then the bias isn't important. But, if your bias reduces OR, and your confidence interval is below 1, you don't know if it is because of bias or not. So sometimes you can interpret data in the face of bias, and sometimes not. Confounding: confounding is a kind of bias that you *can* do something about both before and after you collect the data. if you realize you have confounding after you already collect the data, you can get rid of it. confounding is often called a mixing of effects, but that isn't fully explanatory. it occurs when the estimate of the effect of an exposure is distorted because it is mixed with the effect of some other factor. what you're really measuring when you measure an RR or OR affected by confounding, is the effect of the thing you are looking at and the effect of something else, that you don't even yet realize exists. example of confounding: pigs in ventilated pig houses have a higher incidence of respiratory diseases than pigs that do not have fan ventilation. question: is fan ventilation a cause of respiratory disease? NO. it isn't. but there is an association between fans and respiratory disease. why? something else is causing the increase in disease, and that thing is also associated with fans. that's because herd size is a confounding variable. large herds are kept at greater density, and it is the density that is the real risk factor. more animals, kept closer together - increased viral transmission. it so happens that large herds are also much more likely to be supplied with fans. so the confounder is herd size. herd size was confounded with fans. you had an RR > 1 but only because of this confounding. typical confounders are things like age, gender, season, breed, use, occupation. these things all have some impact on disease. if you are comparing two groups and you are interested in some other risk factor, it is important to match these variables. the nature of a confounder: a confounder must have two properties before being called a confounding factor. it must itself play some part in causation of disease, OR be a marker for some other causative factor, and it must be associated with the exposure of interest - the risk factor you are investigating. if you do a case control study/cohort study, you find an association between an exposure E and a disease D. You know it is causal. However, you find a disproportionate increase of D when you increase E- because it is confounded by an associated confounder C which also causes D. if E causes D, and E is associated with NC, but NC doesn't cause disease, then there is no confounding. eg, let NC be gum chewing and sun be E and D be skin cancer. there is no confounding. gum doesn't cause skin cancer even though it is associated with sun exposure. if you studied it, you'd probably find an association b/w gum and skin cancer, but that would be because you'd really be measuring the effect of the sun, which is associated with gum. controlling confounding: a cohort study to test association b/w alcohol drinking and a specific disease. exposure cum incidence relative risk yes 0.00350 2.8 no 0.00125 RR > 1. fine. we think we have an association b/w alcohol and the disease. but, tobacco use is linked with alcohol, and tobacco might be a contributing cause of the disease. maybe we aren't measuring just the effect of alcohol but also of tobacco. maybe it is all the effect of tobacco! we have to stratify the data into smokers and nonsmokers. smokers drinkers ci RR yes yes 0.004 yes no 0.002 2.0 no yes 0.002 no no 0.001 2.0 now, if smoking is a confounding factor, we expect RR to decline. If there is no interaction b/w drinking and smoking, we expect RR to remain constant. so you can get rid of a confounding variable by stratifying it out, making sure the groups you compare are identical wrt everything except that variable. can also stratify case control studies: alcohol yes no cases 350 125 control 100 100 Odds ratio = AD/BC = 350 x 100 / 100 x 125 = 2.8 stratified: tobacco yes no alcohol alcohol yes no yes no 300 50 50 75 75 25 25 75 OR = 2 OR = 2 300x25 / 75x50 etc controlling confounding: before data collection restriction - don't let confounder vary - choose all smokers, or all one breed, or all one sex. if you can't do that... matching - if confounder varies, make sure it varies in the same way in exposed and unexposed groups - match your subjects. after data collection, control confounding by stratifying the data before collection. you could also use multivariate analysis but that's beyond the scope of this course. ---end---