The British Journal of Psychiatry
Outcomes research in mental health
Systematic review


Background Outcomes research involves the secondary analysis of data collected routinely by clinical services, in order to judge the effectiveness of interventions and policy initiatives. It permits the study of large databases of patients who are representative of ‘real world’ practice. However, there are potential problems with this observational design.

Aims To establish the strengths and limitations of outcomes research when applied in mental health.

Method A systematic review was made of the application of outcomes research in mental health services research.

Results Nine examples of outcomes research in mental health services were found. Those that used insurance claims data have information on large numbers of patients but use surrogate outcomes that are of questionable value to clinicians and patients. Problems arise when attempting to adjust for important confounding variables using routinely collected claims data, making results difficult to interpret.

Conclusions Outcomes research is unlikely to be a quick or cheap means of establishing evidence for the effectiveness of mental health practice and policy.

Randomised controlled trials have generally been accepted as the gold standard when deciding which interventions work in psychiatry (World Health Organization, 1991). Most randomised studies in psychiatry have investigated the effect of drug or psychotherapy interventions in tightly controlled and largely artificial experimental conditions (Hotopf et al, 1997; Thornley & Adams, 1998), while patients, clinicians and other decision-makers need to know how treatments work in the real world and whether they are cost-effective under routine conditions (Wells, 1999). Important questions relating to the organisation and delivery of mental health services are also rarely addressed in randomised trials (Gilbody & Whitty, 2002).

The need for research relating to effectiveness (rather than efficacy) has prompted a number of responses. One has been the call to conduct randomised trials in ‘real world’ settings, using pragmatic designs (Hotopf et al, 1999); another has been to synthesise various data sources using decision analysis (Lilford & Royston, 1998). A response that has been influential in the USA in the past decade involves the analysis of large databases of patient information collected in routine care settings — known as outcomes research (Anonymous, 1989; Ellwood, 1988; Wennberg, 1991).


The ‘outcomes’ movement emerged as a consequence of rapidly escalating costs, acceleration of the introduction of new health technologies and evidence of massive regional variations in the delivery of health care in the USA (Wennberg, 1990; Thier, 1992; Wennberg et al, 1993; Davies & Crombie, 1997). Paul Ellwood, in his 1988 Shattuck lecture (Ellwood, 1988), ushered in the modern outcomes movement and called for the routine collection of outcome measures by clinicians. He proposed that these records should be assimilated in large databases that would form a resource for clinical and health services research. Such data could eventually be used inter alia to compare existing treatments and to evaluate new technologies, thereby avoiding both the expense of clinical trials and the loss of generalisability that results from selective recruitment to conventional efficacy trials.

The Agency for Health Care Policy and Research (AHCPR), now the Agency for Healthcare Research and Quality (AHRQ), was established in the USA under public law in 1989 in order to conduct outcomes research into common medical conditions, with the establishment of patient outcome research teams (PORTs; Wennberg et al, 1993). The research programme was allocated US$6 million in its first year, rising to $63 million in 1991, with the purpose of using routine outcomes data to determine ‘ outcomes, effectiveness and appropriateness of treatments’ (Anderson, 1994). It was decreed by Congress via the General Accounting Office (1992) that new primary research conducted by the PORTs was not to take the form of the traditional randomised controlled trial; rather, it was to be observational in design, utilising the vast amounts of data routinely collected on US patients. This health research policy produced a new breed of health researchers known as database analysts (Anonymous, 1989, 1992), with the motto ‘ Happiness is a humongous database’ (Smith, 1997).

Outcomes research differs from traditional observational or quasi-experimental research in a number of ways. The key difference is that outcomes research evaluates competing interventions that are already used in routine care settings, using routine data collected by clinicians or by other agencies (such as insurance companies), whereas quasi-experimental studies implement interventions in one setting or in one group of patients, and compare outcomes with patients who have not been subjected to the intervention (Gilbody & Whitty, 2002). Quasi-experimental studies are therefore more like randomised trials and are considered to be clearly different in their approach and ethos to outcomes research (Aday et al, 1998). The outcomes that are studied in outcomes research are generally those that are already collected as part of routine care, although there is no reason why these cannot be extended in the light of the specific question being asked.

The application of outcomes research to UK mental health services has been advocated in psychotherapy (Barkham et al, 1998; Mellor-Clarke et al, 1999; Guthrie, 2000; Margison et al, 2000). Similarly, the pharmaceutical industry is keen to extend the method in the evaluation of new and relatively expensive drug therapies; for example, the Schizophrenia Outpatient Health Outcomes Study (SOHO), funded by Eli Lilly, aims to recruit European collaborators to collect outcomes from patients with schizophrenia who are in receipt of typical and atypical drugs. Others have urged caution (Sheldon, 1994); the principal concerns that have been expressed about outcomes research are their observational (rather than experimental) design; the poor quality of the data that are used; the inability to adjust sufficiently for case mix and confounding; and the absence of clinically meaningful outcomes in routinely collected data (Iezzoni, 1997).

This article presents the first systematic overview of the application of outcomes research in evaluating competing interventions in mental health, and discusses how this approach might meet the needs of clinicians and decision-makers.


A search was made for all published examples of outcomes research conducted in psychiatric settings or among psychiatric populations, where two or more competing interventions were compared. (See Appendix for search strategies.)

Inclusion criteria

Reports were included if they fulfilled the following criteria:

  1. The research was conducted in a setting that was part of usual care in a healthcare system

  2. The outcome data used were those collected routinely for all patients — either for administrative purposes or as a means of monitoring outcomes in the service being evaluated.

Exclusion criteria

We excluded studies that examined only the costs or processes of illness and health care from routinely collected data, with no linkage to the outcomes of care. For example, primary care prescription databases have been used to conduct research into newer psychotropic drugs (e.g. Donoghue et al, 1996), but since they are not linked to patient-level data and outcomes, they cannot be considered as outcomes research.

Also excluded were quasi-experimental or non-randomised evaluations of new technologies, where an intervention was implemented and outcomes measurement systems established only in the course of its evaluation (Cook & Campell, 1979). For example, the PRiSM psychosis study (Thornicroft et al, 1998) is an example of a quasi-experimental evaluation of a model of community care for those with severe mental illness, where districts were non-randomly allocated to implement an experimental service, and outcomes were measured under experimental and control conditions as part of the study.

Studies that only examined the relation between patient characteristics and outcome, with no direct comparison between competing treatments or health policy strategies (e.g. Rosenheck et al, 1997), were excluded, as were reports of routine outcomes measurement in practice, with no direct report of comparative service or treatment evaluations based on the data.

Data extraction

Data were extracted on the following topics: population; clinical or organisational question being asked; setting; sample size and length of follow-up; outcomes studied and their source; adjustment for case mix and confounding; and results.


Despite the widespread advocacy of outcomes research in health care, only nine published examples were found relating to mental health. Most of these were published in the past 3 years, highlighting an increase in the use of the design. The scope, design and analysis of the studies we identified are summarised in Table 1, and their most important characteristics are reviewed below.

View this table:
Table 1

Examples of outcomes research in mental health

Research questions addressed

Outcomes research has been used broadly in two areas of mental health research.

Evaluation of mental health policy, including aspects of service delivery, organisation and finance

The earliest and perhaps most important example of outcomes research in mental health is the Medical Outcomes Study (MOS) conducted by the RAND Corporation in the USA in the late 1980s (Tarlov et al, 1989; Wells et al, 1989, 1996). The design and objectives of this study were shaped by US health-care policy debates on the role of financing and reimbursement strategies in private care (fee for service v. prepayment) and on the place of speciality (secondary) care.

The researchers justified the use of observational methods in two ways. First, they claimed that the cheaper design and reduced burden on participants could maximise the number and range of collaborators and patients, particularly from non-research settings. Second, they claimed that the specific research questions precluded the use of randomisation, since the very act of randomisation would alter the functioning of existing health-care delivery systems (Wells et al, 1996).

Three other studies looked at health policy and organisation questions, such as the consequences of the withdrawal of mental health benefits from insurance plans (Rosenheck et al, 1999a), the effectiveness of services directed at homeless people (Lam & Rosenheck, 1999) and the difference in outcome between privately and publicly funded health providers (Leslie & Rosenheck, 2000).

Evaluation of new technologies

Four studies (Hong et al, 1998; Melfi et al, 1998; Croghan et al, 1999; Hylan et al, 1999) used an outcomes research design to demonstrate the worth of new anti-depressant and antipsychotic medication in routine care settings. One further study (Rosenheck et al, 2000) examined the value of an innovative psychosocial intervention for those with war-related post-traumatic stress disorder (PTSD).

Source and choice of cases and outcomes

Outcomes studies can be broadly be divided into those that collect data prospectively on a service-wide level, where the choice of outcomes is decided a priori and is influenced by the research question or population under examination, and those that use existing outcomes data, collected for other purposes.

The MOS is the best-known example of prospective outcomes research. The authors set out to measure patient-centred outcomes, in addition to clinician-rated depressive symptoms within existing health care services. The enduring legacy of the MOS is the fact that patient-centred measures of health status were developed for the study and eventually evolved into the Short Form 36 (SF36) (Stewert & Ware, 1992) — now the most commonly used generic measure of health-related quality of life.

A further study (Rosenheck et al, 2000) measured a number of outcomes, including disease-specific measures relating to the underlying condition (PTSD), measures of social function, health-related quality of life, and service use. This study used a large, existing data-set describing all of the 600 000 patients in receipt of mental health care under the US Veterans Affairs (National Committee on Quality Assurance, 1995). It was supplemented with routinely collected disease-specific patient outcomes measures collected for all patients in receipt of care for PTSD (Rosenheck, 1996).

All the other studies that we identified used existing outcomes already entered on large administrative databases, studying a much more limited range of outcomes. For example, studies examining the value of new antidepressant drugs in routine care settings used a commercially available medical insurance database of linked pharmacy and medical claims data on 750 000 individuals (Melfi et al, 1998; Croghan et al, 1999; Hylan et al, 1999). Cases of depression were identified retrospectively, either from a reimbursemnet claim for antidepressant medication or by the presence of one of six ICD codes indicative of depression (World Health Organization, 1992). This approach is hampered by the fact that antidepressant drugs are commonly prescribed for a number of conditions other than depression (Streator & Moss, 1997). Similarly, depression is consistently underidentified by clinicians (Jencks, 1985) and mislabelled or under-reported, in part as a consequence of the stigma of mental illness (Rost et al, 1994).

Commercially available administrative databases also hold no direct information about disease severity, such as scores on symptom rating scales. Disease progression, relapse or remission cannot be directly measured, and database studies are forced to use alternatives. For example, Hylan et al (1999) used continuous 6-month claims for refills of prescriptions as a proxy measure of acceptable pharmacotherapy and therefore good outcome, ignoring the fact that patients discontinue medications for a whole host of reasons other than treatment failure.

Sample size and length of follow-up

Sample size was generally much greater than that achieved in the traditional randomised trial, with a median sample size of 2678 (range 1034 to 20 814). Studies that recruited subjects prospectively, such as the MOS (Wells et al, 1989), achieved smaller sample sizes (n=1772) than those selecting subjects retrospectively from large, existing data-sets (Croghan et al, 1999; Rosenheck et al, 1999a) (median n=4052). Periods of follow-up were of median 6 months (range 4 to 48 months).

Adjustment for confounding and case mix

All studies made some attempt to describe and adjust for confounding factors, typically using some form of regression analysis or propensity scoring (Rubin, 1997). Authors rarely reported each of the potentially confounding factors that were entered into their analysis — often restricting reports to those that were positive and related to outcome. However, it was clear that the ability of studies to adjust for confounding was determined by the collection or availability of suitable measures. Two studies serve to illustrate the contrast between limited and more complete adjustment for confounding.

The authors of the MOS prospectively measured a broad range of case-mix variables, including disease severity and comorbidity, in addition to traditional demographic characteristics such as age, gender and socio-economic status. This is especially important in the MOS since the type of health care provider is inexorably linked to disease severity, making unadjusted comparisons of outcome impossible to interpret. One of the more unexpected results of the MOS demonstrates the limitation of an observational approach and the need to measure and adjust for case mix and confounding. In unadjusted samples, the receipt of any treatment (antidepressant medication or counselling) was associated with a much worse 2-year outcome than the receipt of no treatment. In analyses that adjusted for baseline health differences, treated and untreated patients had a comparable 2-year outcome. In a subgroup analysis, designed to minimise unmeasured biases by restricting the analysis to those with the most severe depression, treatment was in fact associated with a significantly better 2-year outcome (Wells et al, 1996; Wells, 1999).

In contrast, outcomes studies based on administrative data are much more limited in their ability to measure and adjust for confounding. For example, in retrospective database studies of new antidepressant drugs (Melfi et al, 1998; Hylan et al, 1999) disease severity could not be measured since these data were not directly included in administrative data and could only be crudely inferred from the setting in which care was given (primary v. secondary care).


Despite the enthusiasm with which outcomes research was adopted and funded in the USA, by the 1990s its value was being called into question. The US Office of Health Technology Assessment (1994) offered a stinging appraisal: ‘Contrary to the expectations expressed in the legislation establishing the AHCPR... administrative databases have generally not proved useful in answering questions about the comparative effectiveness of alternative medical treatments.’ Clearly, the superficially appealing opportunity to generate largescale studies from readily available and existing data sources should be approached with caution. This review highlights both the strengths and the limitations of outcomes research as a method for evaluating mental health services.

Strengths of outcomes research

The criticism is often made that randomised trials are undermined by the fact that the participants form a highly selected and homogeneous group, and their health care and follow-up are different from that received by the majority of patients (Anonymous, 1994). The consequence is that it is not always possible to apply the results in clinical practice — in other words, trials lack external validity (Naylor, 1995).

One potential advantage of outcomes research is that observational data are routinely collected for all patients and the results can therefore be applied more generally. Further, data are generated in routine health-care services, rather than in artificially constructed trials. Lastly, outcomes research might be able to deliver answers to some questions quickly, cheaply and with greater statistical power, and without the need to seek ethical approval and individual patient consent, compared with the time-consuming and costly randomised trial. This review suggests that outcomes research in mental health has indeed realised these advantages — incorporating large numbers of subjects from real-life clinical populations and following them up for clinically meaningful periods of time.

Weaknesses of outcomes research

Ellwood's original vision of outcomes research required that a rich and clinically meaningful set of outcomes would be collected for all patients during their routine care (Ellwood, 1988). However, the feasibility and cost of such data collection has meant that the building blocks of much outcomes research (with notable exceptions) have been data that are collected as part of the administrative process (Iezzoni, 1997). These administrative data (produced by federal health providers, state governments and private insurers) contain the minimum amount of information required to fulfil an administrative function, particularly billing. They generally include little more than routine demographic data, ICD-9 diagnostic codes, details of interventions received during a hospital episode, length of stay and mortality during a hospital episode. The fundamental problem with research using these data is that the outcomes available are generally not those that we would like to study. Research becomes driven by the availability of data rather than by the need to answer specific questions, as acknowledged by one outcomes researcher: ‘I utilise data that are available. I do not start with “what is the problem and what is the outcome?” I say, “ given these data, what can I do with them?” ’ (Blumberg, 1991).

The other major problem with outcomes research, as with all observational research, is the problem of confounding and selection bias (Cook & Campbell, 1979; Iezzoni, 1997). The treatment that a patient receives will often be determined by a number of factors that are related to outcome, such as disease severity. Thus patients will differ in many ways other than the treatment they receive, and it is therefore difficult to attribute any differences in outcome to the treatment itself (Green & Byar, 1984).

Our review suggests that, in mental health, large-scale studies using ‘ humongous databases’ are largely achieved at the expense of clinically meaningful outcomes and limited opportunities to adjust for confounding. Only two studies stand out as having collected a broad range of clinically important outcomes and case-mix variables, reflecting not just disease severity but the facets of service use and health-related quality of life — the MOS (Wells et al, 1989) and Rosenheck's study of PTSD (Rosenheck et al, 1999b).

Can outcomes research ever be useful in the UK?

Professor Nick Black has recently called for the establishment of large-scale, high-quality clinical databases across all disciplines in the UK (Black, 1999). The most ambitious example of this work in the UK has been in the field of intensive care (Rowan, 1994). According to Black, such databases need not be seen as an alternative to the randomised trial, but rather as a complement. The attractions for researchers include the possibility of generating large samples from many participating centres, and of including clinically important subgroups of patients who might be excluded from traditional trials. Outcomes research can also be used to promote rather than replace randomised trials in a number of ways. First, raising the level of uncertainty among clinicians as to the effectiveness of established interventions might increase clinicians' likelihood of participating in a randomised trial. Second, it could provide a permanent infrastructure for mounting multi-centre trials. Finally, the adoption of such databases means that research would no longer be the preserve of a minority of clinicians working in specialist centres, thus enhancing the generalisability of the results.

How feasible are such developments in mental health research in the UK?

The absence of a centralised administrative data-collection system in the UK has meant that the building blocks of outcomes research have never developed to the extent that they have in the USA. Initiatives to ensure that uniform outcomes are collected for all patients, such as the Health of the Nation Outcome Scales (Wing, 1994), have been proposed but have not so far been adopted in routine practice (Slade et al, 1999). Consequently, the adoption of routine outcomes monitoring will entail substantial effort.

Research initiatives are under way; for example, the Centre for Outcomes Research and Effectiveness (CORE) has been established under the auspices of the British Psychological Society (Clifford, 1998) in order to generate ‘practice-based evidence’ of effectiveness framed within routine services (Marginson et al, 2000). At this juncture, it would be timely to learn from the examples of outcomes research in the USA, and to recognise the limitations and potential of the approach.

Rosenheck et al (1999b), who provided one of the more rigorous examples of outcomes research, outlined several ingredients of a successful clinical database, capable of producing rigorous and informative research. Outcomes databases should:

  1. include large numbers of subjects

  2. use standardised instruments that are appropriate for the clinical condition being treated

  3. measure outcomes in multiple relevant domains

  4. include extensive data in addition to outcomes measures, in order to support matching

  5. collect data at standardised intervals after a sentinel event such as entry to or discharge from hospital

  6. take aggressive steps to achieve the highest possible follow-up rates.

Data should also be collected prospectively if they are to meet these aims.

Such databases are going to require substantial time, effort and expense to establish, making outcomes research far from the quick and cheap research option that was envisaged. For example, the whole MOS cost US$12 million, and the depression component cost about US$4 million (Wells et al, 1996). Outcomes research requires resolution of the practical and ethical problems of using clinical data for study purposes, as highlighted in recent debates about the Data Protection Act, the European Human Rights Act and the Health and Social Care Bill (Al-Shahi & Warlow, 2000; Medical Research Council, 2000; Anderson, 2001; Kmietowicz, 2001).

The pharmaceutical industry is especially keen to use outcomes research to examine the effectiveness of its products. This review highlights the fact that, so far, outcomes studies conducted by the pharmaceutical industry have been generally of poor quality and do not adhere to the sensible recommendations outlined by Rosenheck et al (1999b). The use of this method has clear advantages for the pharmaceutical industry — particularly in terms of cost. In conducting such research, the industry can claim that expensive (pragmatic) randomised trials are no longer needed in order to examine clinical and economic effectiveness in routine care settings; neither will they have to provide and dispense the drugs for the many thousands of patients who are included in these studies. Informed consent and ethical approval may no longer be required, since treatment is as received, as part of usual care, and outcomes are those that are collected anyway. Large-scale outcomes studies that are currently in progress — such as the SOHO Study — will need to demonstrate that they are methodologically robust and that their results are believable.

Mental health researchers must give clear thought as to how outcomes databases should be constructed, how resources might be put in place, and to what extent informed consent is required for research conducted using these data. Outcomes research should not be seen as an alternative to randomised controlled trials, but rather as a complement. Clinicians do not generally like collected standardised data for each and every patient (Walter et al, 1996a, b; Slade et al, 1999). It would be unfortunate if outcomes research was simply to be regarded as a quick and flawed solution to the many political and clinical problems in mental health.

Clinical Implications and Limitations


  • Robust evidence is needed of the effectiveness of new and existing treatments, interventions and policy initiatives in mental health.

  • Randomised trials have formed the ‘gold standard’ of this evidence but are subject to many limitations.

  • Outcomes research has the potential to provide ‘real world’ evidence of clinical and economic effectiveness, relatively quickly and cheaply, using routinely collected data from clinical services.


  • Outcomes research uses an observational design and is subject to many limitations — principally bias and confounding.

  • The quality of the data upon which outcomes research is based is often poor.

  • Successful outcomes research depends upon the routine collection of diverse and clinically meaningful outcomes, which requires substantial effort and cost.


Search terms

The following bibliographic databases were searched: Medline (1966-2000); EMBASE (1981-2000); Cinahl (1982-2000); PsycLit (to 2000). In addition, we hand-searched a number of key journals and scrutinised reference lists for additional studies; we contacted key authors in the field. Our search included the following terms.



The authors are grateful to Kate Misso for performing all literature searches.


  • See editorial, pp. 1–2, this issue.

  • Received February 14, 2001.
  • Revision received June 14, 2001.
  • Accepted June 14, 2001.


View Abstract