|
|
|||||||||||
Electronic Letters to:
|
|
Electronic letters published:
|
|
|||
|
Douglas Mossman, Professor of Psychiatry Wright State University Boonshoft School of Medicine, Thomas M. Sellke
Send letter to journal:
douglas.mossman{at}wright.edu Douglas Mossman, et al.
|
In discussing actuarial risk assessment instruments (ARAIs), Hart and colleagues (2007) acknowledge that "prediction" may refer to probabilistic statements (e.g., a "prediction" that an individual "falls in a category for which the estimated risk of violence was 52%" [p. s60]). For unclear reasons, however, the authors seem to value only predictions with right-or -wrong outcomes. They therefore regard statements about future behavior of large groups (where one can be almost certain that the fraction of persons who act a certain way will fall within a narrow range of proportions) as potentially "credible," but predictions for individuals as meaningless. If the purpose of risk assessment is to make choices, however, then well-grounded probabilistic predictions about single events help us. Suppose we conclude it is legally and ethically acceptable to impose preventive confinement upon individuals in ARAI categories with estimated recidivism rates above a specified threshold T. This policy entails making "false negative" and "false positive" decision errors. We recognize, however, that unless we are omniscient, perfection is not an option, and ARAIs simply help us make better decisions than we otherwise could. How do "margins of error" in estimated recidivism rates affect our decision process? Hart and colleagues believe their "group risk" and "individual risk" 95% confidence intervals (CIs) speak to this problem. Their group intervals are standard CIs for estimated population proportions based on random samples. If T lies outside the group risk CI for a category, then we can be reasonably certain that a decision we make concerning someone in that category is the same decision we would make if we knew the true recidivism rate for that category. If T falls within a category’s group risk CI, then our estimate quite possibly might lead to the "wrong" decision. Statistical decision theory (Berger, 1985) shows, however, that it is still a sensible strategy to choose whether to confine a member of a category based on which side of T our estimated risk falls. Hart and colleagues talk about "individual risk" as though it is something different from category (or "group") risk. Yet if all one knows about an individual is his membership in a risk group, what can "individual risk" mean? The authors do not say. If "individual risk" refers to believed-to-exist-but-unspecified differences between individuals within a category, however, such differences should not affect choices by a rational decision-maker. The 95% CIs for "individual risk" pile nonsense on top of meaninglessness. Hart and colleagues describe the replacement of "n" by "1" in the Wilson (1927) formulae as "ad hoc," but this substitution makes no sense when the basis for the estimated proportion is an n-member sample. With "1" in place of "n," the formulae just don’t mean anything. Using ARAIs raises serious moral problems as well as the valid scientific questions that Hart and colleagues mention. But in faulting ARAIs’ capacity to address an unspecified quantity called "individual risk," and in dressing up this notion with misapplied formulae for CIs, Hart and colleagues ultimately create a muddle. References Hart, S.D., Michie, C., & Cooke D.J. (2007) Precision of actuarial risk assessment instruments: Evaluating the ‘margins of error’ of group v. individual predictions of violence. British Journal of Psychiatry, 190, s60-s65. Wilson, E.B. (1927) Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22, 209-212. Berger, J.O. (1985) Statistical Decision Theory and Bayesian Analysis, 2nd Edition. Springer-Verlag, New York. Authors Douglas Mossman, M.D., Department of Psychiatry, Wright State University Boonshoft School of Medicine, 627 S. Edwin C. Moses Blvd., Dayton, OH, 45408-1461 USA Thomas Sellke, Ph.D., Department of Statistics, Purdue University, 150 N. University Street, West Lafayette, IN 47907-2067 USA Declaration of Interest Neither of the authors has received fees or grants from, is employed by, has consulted for, has shared ownership in, or has any close relationship with an organization whose interests, financial or otherwise, would be affected by the publication of this letter. |
|||
|
|
|||
|
Grant T. Harris, Director of Research Mental Health Cebtre, Penetanguishene, Marnie E. Rice and Vernon L. Quinsey
Send letter to journal:
gharris{at}mhcp.on.ca Grant T. Harris, et al.
|
Hart, Michie, and Cooke reminded readers of a supplement to this journal that, “Predicting the future is very difficult” (p. s63). All physicians are acutely aware of the difficulty of prognosis. But does this mean it should not be attempted? Competent practice, especially for serious conditions and therapies carrying risks, is impossible without some evaluation, one patient at a time, as to the likelihood of various outcomes as a function of various contemplated interventions (including no intervention), or as a function of various diagnostic tests. The advice from Hart and colleagues seems to call for clinicians to eschew empirical data about outcomes among groups of similar patients, but they failed to advise readers about what to do instead. Statistical and Technical Matters Hart and colleagues made a statistical argument that the results of widely replicated actuarial systems for forensic risk assessment (the Violence Risk Appraisal Guide, VRAG, and the Static-99) must be “virtually meaningless” (p. s60). Unfortunately, they were led into statistical error by conflating test reliability and validity -- precision of measurement must be treated separately from a test’s association with an outcome. The first error resulting from this conflation was using confidence intervals to assess the “precision” or “margin of error” for an individual test result; in fact, confidence intervals were not designed for this purpose. The appropriate statistic is the standard error of measurement – the margin of error associated with a single person’s true score (an aspect of reliability). The VRAG’s standard error of measurement has been reported both for the development sample and independent replications (Quinsey, Harris, Rice, & Cormier, 2006) consistently indicating that any single score has a only a .05 probability of yielding misclassification by more than one VRAG category. The analysis by Hart and colleagues was correct in one sense – the amount of misclassification to be expected does vary as a function of the score – those at the extremes exhibit greater risk of misclassification. Again, however, confidence intervals are not the way to compute this error; conditional standard error of measurement is the statistic for this purpose. Hart and colleagues’ analysis of confidence intervals did legitimately “prove” statistically that one usually cannot learn much from a single case -- one observation usually conveys only a little scientific information. But is it true, as they imply, that a single observation conveys absolutely no information? Readers will recognize that most research findings are simply the aggregation of many single observations. The fact that some research findings yield consistent replication inevitably means that single observations do convey valid scientific information. It’s just that we must often aggregate the single observations in order to evaluate and learn from them. Hart and colleagues’ second mistake related to aggregated findings about the accuracy of actuarial tools. They slipped from “precision” to “accuracy” as though these are formally synonymous. They are not. As most medical professionals know, test accuracy (i.e., validity) is assessed in terms of sensitivity, specificity, and the tradeoff between these two. We are aware of more than 40 independent tests of the accuracy of the VRAG (and its allied tool the Sex Offender Risk Appaisal Guide, SORAG) in predicting violent recidivism in a total of approximately eight thousand released correctional inmates, sex offenders, forensic patients, civil psychiatric patients, and other clinical samples. These tests have been conducted in at least seven countries and have employed mean follow-up periods ranging from a few months to ten years. By conventional standards, average predictive effects (in terms of the sensitivity-specificity tradeoff) are large and are distributed as expected by psychometric principles and the laws of probability. Contrary to the assertions of Hart and colleagues, VRAG/SORAG scores have been shown to predict the speed and severity of violent recidivism. If recalculated using all available cases, confidence intervals for category outcomes would be considerably smaller than those calculated for the development sample alone. Similarly, we are aware of approximately 40 replications, involving more than 13,000 cases, of the Static-99. The statistical argument by Hart and colleagues does not and cannot refute these empirical results supporting the accuracy of actuarial risk assessments. It is instructive to consider the argument by Hart and colleagues in a broader medical context. Predicting violent recidivism with actuarial instruments is, in principle, no different than using diagnostic tests to predict development or outcome for such disorders as cancer. The accepted measure of predictive and diagnostic accuracy is the area under the Relative Operating Characteristic (ROC, Swets, Dawes, & Monahan, 2000) which indexes the tradeoff between sensitivity and specificity as a function of test score. Under conditions of good measurement reliability, equal follow-up duration, and few missing items, the VRAG produces ROC values that compare favorably with widely used diagnostic tests (Quinsey et al., 2006). This is true even though the accuracy of actuarial instruments is artifactually lowered by error in measuring the outcome (violent reoffending recorded in official records) whereas the accuracy of diagnostic tests for cancer prediction is generally less affected by such measurement error (for example, using death or autopsy results as the predicted outcome). Because ROC analyses are the standard for accuracy, the advice of Hart and colleagues would seem to require that many diagnostic tools also be abandoned. Finally, classification accuracy is the standard in assessing the kind of “precision” attempted by Hart and colleagues. In most tests of the VRAG, there have been no statistically significant differences between the observed rates and those expected on the basis of the proportions provided as norms (Harris & Rice, in press), especially given known variation predicted by Bayes’ Rule. Thus, classification accuracy has also been successfully replicated. In essence, Hart et al. have attempted, but failed, to gainsay an empirical result with a statistical argument. The notion that it is somehow wrong to base individual decisions on “group data” has been thoroughly refuted (Grove & Meehl, 1996; Quinsey et al., 2006). Consider the example offered by Hart and colleagues themselves – betting on whether a card other than a diamond (probability = .75, 3 to 1 odds) will be drawn from an ordinary deck of playing cards. Hart and colleagues assert that one can have little confidence in winning in a single trial. What do they then advise -- bet on a diamond?! A careful reading of their paper yields only one piece of advice – refuse to bet. Yet consistently betting against a diamond is the winning strategy and all rational gamblers would make that bet. In the context of violence risk assessment over long durations, offenders in the highest two VRAG categories have generally exhibited probabilities of officially detected violent recidivism greater than 75 percent. And the lowest four categories have consistently exhibited rates below 25 percent. Surely forensic clinicians should not refuse to provide this information to those making decisions about violent offenders. Clinical Decisions about One Case What should a forensic clinician do when deciding to release or detain one previously violent forensic patient? Hart et al. imply that the clinician should make no release decision, presumably leaving it up to the unaided judgment of others. We disagree. An actuarial tool (such as the VRAG or Static-99) is simply an efficient, available distillation of relevant empirical evidence. An actuarial tool does not afford certainty, of course, but, as Hart and colleagues fully acknowledged, it affords more accuracy than any other known method for making such decisions. In conclusion, the undeniable superiority in accuracy of actuarial systems over all known alternatives means they must be used where available. Except for refusing to make risk-related decisions, Hart and colleagues offer no alternative for actual forensic practice. Taken seriously, their advice is likely to worsen the practice of clinicians who must make decisions about the risk of violent recidivism. Reluctance to make risk-related decisions based on actuarial methods may well have a motivation in addition to (misguided) concerns about accuracy, however. These concerns relate to clinical and philosophical objections to civil commitment for sex offenders in the U.S. and the dangerous severe personality disorders legislation in the UK (Monahan, 2006). Although we have some sympathy here, it is important to understand that the reliability and validity of actuarial instruments are independent of their use in particular schemes for sentencing and managing offenders. Further, if forensic clinicians refuse to make risk-related decisions, decisions will be made by others using less accurate means: less accurate decisions inexorably accumulate in more avoidable harm to victims, more unnecessary restriction of offenders, or both. References Grove, W. M. & Meehl, P. E. (1996) Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: The clinical–statistical controversy. Psychology, Public Policy, and Law, 2, 293–323. Harris, G.T. & Rice, M.E. (in press) Characterizing the value of actuarial violence risk assessment. Criminal Justice and Behavior. Monahan, J. (2006) A jurisprudence of risk assessment: Forecasting harm among prisoners, predators, and patients. Virginia Law Review, 92, 391-435. Quinsey, V.L., Harris, G.T., Rice, M.E., & Cormier, C.A. (2006) Violent offenders: Appraising and managing risk (Second Edition). Washington, DC: American Psychological Association. Swets, J., Dawes, R., & Monahan, J. (2000) Psychological science can improve diagnostic decisions. Psychological Science in the Public Interest, 1, 1-26. Declaration of Interest: None Grant T. Harris, Ph.D. Director of Research, Mental Health Centre, Penetanguishene; Associate Professor of Psychology (adjunct), Queen's University; Associate Professor of Psychiatry (adjunct), University of Toronto; 705-549-3181, fax: 705-549-3652 Marnie E. Rice, Ph.D., FRSC Professor of Psychiatry and Behavioural Neuroscience, McMaster University; Professor of Psychiatry (adjunct), University of Toronto; Associate Professor of Psychology (adjunct), Queen's University Vernon L. Quinsey, Ph.D. Professor of Psychology, Biology, and Psychiatry, Queen's University |
|||
|
|
|||
|
Stephen D. Hart, Professor Department of Psychology, Simon Fraser University, Christine Michie, and David J. Cooke
Send letter to journal:
hart{at}sfu.ca Stephen D. Hart, et al.
|
Actuarial risk assessment instruments (ARAIs), constructed using data from known groups, are used to make life-and-death decisions about individuals. How precisely do they estimate risk in individual cases? The 95%CI for proportions, which evaluates the precision of risk estimates for ARAI groups, cannot be used for individual risk estimates unless one makes a very strong assumption of heterogeneity – that ARAIs carve nature at its joints, separating people with perfect accuracy into non-overlapping categories. No-one, not even those who construct ARAIs, makes this assumption. So, we ask again, what is the precision of individual risk estimates made using ARAIs? Professors Mossman and Sellke criticized us for inadequately defining “individual risk.” They also criticized us for using an ad hoc procedure to estimate the margin of error for individual risk estimates, which they opined served only to “pile nonsense on top of meaninglessness.” We must plead guilty to some of the charges leveled by Mossman and Sellke – indeed, we pled guilty in our paper, acknowledging the conceptual and statistical problems with the approach we used. In our defence, we claimed duress: Because developers used inappropriate statistical methods to construct ARAIs, we could not use appropriate methods to evaluate them. Violent recidivism was measured in the ARAI development samples as a dichotomous, time-dependent outcome, and so the developers ought to have used logistic regression or survival analysis to build models; if they had, one could directly calculate logistic regression or survival scores for individuals and their associated 95%CIs. But we also plead that these charges are irrelevant to our conclusion. As we discussed, to reject our findings that the margins of error for individual risk estimates are is to acknowledge that they are either unknown or incalculable. Regardless, the current state of affairs is unacceptable for those who seek to use these tests in a professionally responsible manner or argue in favor of their legal admissibility. We urge ARAI developers to recalibrate their statistical models in a way that permits direct calculation of individual risk estimates and their precision or to make their data publicly available so others may do so. |
|||
|
|
|||
|
Stephen D. Hart, Professor Department of Psychology, Simon Fraser University, Christine Michie, and David J. Cooke
Send letter to journal:
hart{at}sfu.ca Stephen D. Hart, et al.
|
Harris, Rice, and Quinsey (HRQ) claim that: 1. We “misapplied confidence intervals” to actuarial test scores. But we used CIs to evaluate the estimated probability of violence associated with test scores, not the raw scores themselves. The (many) problems with raw scores on actuarial tests are a separate issue. 2. We used “precision” and “accuracy” synonymously. But we did not, we simply recognized the important association between these concepts: The accuracy with which actuarial tests can predict future violence in an individual case depends on the precision of group data. As every research trainee learns, reliability places an upper bound on validity. 3. Their sanguine views about basing individual decisions on group data are supported by Grove and Meehl (1996). But they ought to read Grove and Meehl more carefully: “There is a real problem, not a fallacious objection, about uniqueness versus aggregates in defining what statisticians call the reference class for computing a particular probability in coming to a decision about an individual case” (1996; p. 306). Grove and Meehl’s (lengthy) discussion, which includes the issue of the precision of group estimates, is echoed in our paper. 4. Their belief in the “undeniable superiority of actuarials” is supported by Grove and Meehl (1996). But HRQ continue to confuse group and individual data. Grove and Meehl concluded that actuarial decision making was superior to clinical judgment in about 45% of the studies they reviewed; in the others, clinical judgment was equally accurate or even more accurate. Put differently, the “on average” superiority of actuarials translated into superiority in slightly less than half of the individual comparisons. This is an important trend, obviously, but hardly a sound basis for high-stakes gambling on one outcome. As good scientists, we recommend against betting big on the toss of a single coin. We strongly support evidence-based practice, but HRQ have confused “evidence based” with “statistically based.” They should recognize that in forensic mental health, as in many areas of life, good practice does not equate to mindless reliance on simplistic statistical algorithms. References Grove, W. M., & Meehl, P. E. (1996). Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: The clinical-statistical controversy. Psychology, Public Policy, and Law, 2, 293-323. |
|||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH |
| Psychiatric Bulletin | Advances in Psychiatric Treatment | All RCPsych Journals |