The British Journal of Psychiatry
Towards a standardised brief outcome measure: psychometric properties and utility of the CORE—OM


Background An acceptable, standardised outcome measure to assess efficacy and effectiveness is needed across multiple disciplines offering psychological therapies.

Aims To present psychometric data on reliability, validity and sensitivity to change for the CORE—OM (Clinical Outcomes in Routine Evaluation — Outcome Measure).

Method A 34-item self-report instrument was-developed, with domains of subjective well-being, symptoms, function and risk. Analysis includes internal reliability, test—retest reliability, socio-demographic differences, exploratory principal-component analysis, correlations with other instruments, differences between clinical and non-clinical samples and assessment of change within a clinical group.

Results Internal and test—retest reliability were good (0.75-0.95), as was convergent validity with seven other instruments, with large differences between clinical and non-clinical samples and good sensitivity to change.

Conclusions The CORE—OM is a reliable and valid instrument with good sensitivity to change. It is acceptable in a wide range of practice settings.

Thornicroft & Slade (2000) said:

“Can mental health outcome measures be developed which meet the following three criteria: (1) standardised, (2) acceptable to clinicians, and (3) feasible for ongoing routine use? We shall argue that the answers at present are ‘yes’, ‘perhaps’, and ‘not known’.”

For psychotherapies, we argue that the Clinical Outcomes in Routine Evaluation — Outcome Measure (CORE—OM) described below can answer ‘ yes’, ‘largely’ and ‘generally’. There have been previous initiatives to create a core battery to assess change in psychotherapy (Waskow, 1975; Strupp et al, 1997). We have analysed some reasons why these did not achieve wide uptake (Barkham et al, 1998). In the UK, the need for a core battery and routine data collection has been acknowledged, as has the need for routine effectiveness and efficacy evidence (Department of Health, 1996, 1999; Roth & Fonagy, 1996). Despite a multitude of measures (Froyd et al, 1996), there is still no single, pantheoretical measure. Such a measure would need to measure the ‘core’ domains of problems, meet Thornicroft & Slade's desiderata and be ‘ copyleft’ (i.e. the copyright holders license it for use without royalty charges subject only to the requirement that others do not change it or make a profit out of it).

Development of the new outcome measure

This paper assesses the self-report CORE—OM. Its rationale and development have been described elsewhere (Barkham et al, 1998; Evans et al, 2000). A team, led by the authors, reviewed current psychological measures and produced a measure refined in two waves of pilot work involving quantitative analyses and qualitative feedback from a wide group of service users and clinicians. This paper reports the psychometric properties and utility of the final measure.

The measure

The measure fits on two sides of A4 and includes 34 simply worded items all answered on the same five-point scale ranging from ‘not at all’ to ‘ most or all the time’. It can be hand-scored or scanned by computer. The items cover four domains: subjective well-being (four items), problems/symptoms (twelve items), life functioning (twelve items) and risk (to self and to others; six items) (see Table 1). Some items are tuned to lower and some to higher intensity of problems in order to increase scoring range and sensitivity to change; 25% of the items are ‘positively’ framed with reversed scores. Overall, the measure is problem scored (i.e. higher scores indicate more problems). Scores are reported as means across items that give a ‘pro-rated’ score if there are incomplete responses. For example, if two items have not been responded to, the total score is divided by 32 (see below). Pro-rating an overall score is not recommended if more than three items have been missed; nor should pro-rating be applied to domains if more than one item is missing from that domain.

View this table:
Table 1

Domains and items of the Clinical Outcomes in Routine Evaluation — Outcome Measure

We recommend the measure to be used before and at the end of therapy. It may be useful to repeat it during therapy in longer therapies and follow-up is highly desirable, if not often sought in current clinical practice. Different therapies and services will address different questions and create very different ‘best’ usage. A research-oriented example is given by Barkham et al (2001). We know of services offering very brief therapies that find it useful as an overall nomothetic assessment on initial and final session completions, whereas other services have posted the measure at referral and repeated it at assessment or first session. Services with waiting times between assessment and therapy have also found checking stability over that interval informative. Reliable and clinically significant change appraisal (see below) supports a case audit of successes and failures.


The analyses assessed whether the measure is usable, reliable (i.e. sufficiently uncontaminated by random error) and valid (i.e. apparently measuring what it intended to). This paper reports specifically on usability, internal reliability, test—retest reliability, convergent validity in relation to other measures and validity in large-effect sizes for clinical/non-clinical comparison and change but small-effect sizes for socio-demographic variables.

The data

Results are reported on data from two main samples: a non-clinical sample and a clinical sample. Samples are described in Table 2.

View this table:
Table 2

Characteristics of non-clinical and clinical samples

The clinical data came from 23 sites that expressed an interest in such a measure in our initial survey of purchasers and providers (Evans et al, 2000) or were known through the UK Society for Psychotherapy Research's ‘ Northern Practice Research Network’. The majority of sites were within the National Health Service (NHS) but also included three university student counselling services and a staff support service. Two services were focused on primary care, whereas others had wider spans of referrals. Leadership and membership varied, including medical psychotherapists, clinical psychologists, counselling psychologists, counsellors and psychotherapists. Theoretical orientation also varied, the majority describing themselves as ‘ eclectic’ and the remainder asserting behavioural, cognitive—behavioural or psychodynamic orientations. Minimal patient demographic information was collected but non-completion rates were not assessed because most services said that they were not logistically ready for this. Data used were the first available when from pre-treatment or the first treatment session.

One non-clinical sample was from a British university with both undergraduate and postgraduate students. To complement this in relation to the ‘ general population’, a ‘sample of convenience’ was sought from non-clinical workers, relatives and friends of the clinicians in the CORE battery team and in the major collaborating sites. Differences between the student and non-student samples generally were minimal and all results reported here are pooled across both.


The CORE—OM data were scanned by computer using the FORMIC data-capturing system (Formic Design and Automatic Data Capture, 1996). Most analyses were conducted in SPSS for Windows, version 8.0.2. Non-parametric tests were used because statistical power was high and distributions generally differed significantly from Gaussian. All inferential tests of differences were two-tailed against P<0.05. The large sample sizes gave high statistical power so that significance would be found for small effects, thus effect sizes and confidence intervals (Gardner & Altman, 1986) generally are reported. Most were produced by SPSS but confidence intervals for Spearman correlations were calculated using Confidence Interval Analysis (CIA; Gardner et al, 1989) and those for Cronbach's alpha were calculated using an SAS/IML (SAS Institute, 1990) program written by one of us (C.E.), implementing the methods of Feldt et al (1987).



The first fundamental requirement of any measure is that respondents complete it. Of the total, 91% of the non-clinical and 80% of the clinical samples returned complete data. The difference is statistically significant (P<0.0005). The numbers omitting few enough items to allow pro-rating showed a different pattern, with 1084 (98%) of the non-clinical and 863 (97%) of the clinical samples retaining sufficient items to allow scoring (P=0.15).

The item that was most often incomplete was no. 19 (‘I have felt warmth and affection for someone’) in both samples (2.5% incomplete in the non-clinical and 3.8% incomplete in the clinical sample). The overall omission rate was 1.7%. If this applied for all items, then the numbers omitted would be distributed binomially. Forty-three or more omitted would be a significantly (P<0.05) elevated number. Items exceeding this were nos 21 and 34 (43 omissions), nos 20 and 30 (44), no. 32 (49) and no. 19 (61). A significantly low number of omissions would be 24 or fewer. These items were no. 3 (23 omissions), no. 2 (20), no. 14 (18) and no. 5 (16). There is heterogeneity in omission of items, with some suggestion that later items were omitted more frequently, but there is no link with domain.

Internal consistency

Internal reliability is indexed most often by coefficient α (Cronbach, 1951), which indicates the proportion of the variance that is covariant between items. Low values indicate that the items do not tap a nomothetic dimension of individual differences. Very high values (near unity) indicate that too many items are being used or that items are semantically equivalent (i.e. not adding new information to each other). All domains show α of >0.75 and <0.95 (i.e. appropriate internal reliability; Table 3). Confidence intervals show that the values are estimated very precisely by the large sample sizes. Despite this, only the problem domain showed a statistically significant lower reliability in the clinical than the non-clinical sample. Even this difference of 2% (88% v. 90%) in the proportion of covariance is not problematic, although its origins may prove to be of theoretical interest.

View this table:
Table 3

Coefficient α (95% CI) denoting internal consistency for non-clinical and clinical samples

Test—retest stability

Very marked score changes over a short period of time would suggest problems. Of 55 students approached, 43 returned complete data from both occasions. Test—retest correlations were highest within domains (see Table 4). The stability of the risk domain was lowest at 0.64, which is unsurprising in view of the small length and situational, reactive nature of these items. The stabilities of 0.87-0.91 for all other scores are excellent. The second part of Table 4 gives the mean change, 95% confidence interval and the significance (Wilcoxon test), showing small but statistically significant falls on some scores.

View this table:
Table 4

Test—retest stability in a non-clinical student sample (n=43)

Convergent validity

As noted, the measure is designed to tap the difference between clients and change in therapy across the three domains. Failure to correlate with appropriate specific measures would suggest invalidity. Correlations (Table 5) are highest against conceptually close measures, showing convergent validity and that scores do not just reflect common response sets.

View this table:
Table 5

Correlations with referential measures in clinical samples

The only exception was for the new version of the Beck Depression Inventory (BDI—II; Beck et al, 1996), where n gives only low precision (95% CI for ρ =0.51-0.87).

In one site — a university counselling service — clinician ratings of ‘significant risk’ were recorded and 7/40 clients were considered to be at risk. Their risk scores differed statistically significantly and strongly from those of the other 33 clients (mean 1.1-0.3; 95% CI 0.41-1.2, P<0.0005), with no statistically significant differences on the other domains. This supports the allocation of risk items to this domain, as does the high correlation with severe depression on the General Health Questionnaire (GHQ).

Differences between clinical and non-clinical samples

The main validity requirement of an outcome measure is that it should discriminate between the clinical populations for which it has been designed and the non-clinical populations. Table 6 illustrates that these differences were large and highly statistically significant on all domains. Confidence intervals are small, showing that the differences are estimated precisely and are large — more than one point on a 0-4 scale for all domain scores other than risk.

View this table:
Table 6

Means and standard deviations for clinical and non-clinical samples

The boxplot in Fig. 1 shows a few patients in the clinical sample scoring zero and a very few patients (outliers) in the non-clinical sample scoring very highly. However, the box for the one sample (which covers the middle 50% of scores in each group) and the median line bisecting the box for the other sample do not overlap.

Fig. 1

Boxplot of mean item score for all items for clinical and non-clinical samples. The box encloses the interquartile range (IQR) (i.e. encloses the middle 50% of scores) and the line through the box marks the sample median. ‘ Whiskers’ extend below both boxes to the minimum scores and for the clinical sample up to its maximum. The non-clinical sample shows a number of outliers (1.5 × to 3 × the IQR above the 75th centile) and extremes (over 3 × IQR), illustrating the presence of very few, very high scorers in the non-clinical sample.

Ethnicity, age and gender differences

Students were asked whether English was their first language. Because omission of items might reflect linguistic problems with the measure, the number of omitted items was related to the first language. This showed that the 50 respondents who said that their first language was not English omitted an average of 2.5 items, as opposed to 0.35 by the other 607 who answered the language question in that survey. This is statistically significant (P<0.0005) but relatively few items were dropped by either group. Internal consistency was similar for the samples, with no statistically significant differences, suggesting that answering in a second language in these samples did not impair internal consistency.

Analysis showed only small correlations between scores and age. There was a statistically significant but negligible increase in symptom scores with age (ρ=0.076, P=0.014) in the non-clinical sample, and small reductions in risk (ρ=-0.15, P<0.0005) and function scores (ρ=-0.10, P=0.004) with age in the clinical sample.

Many psychological measures show gender differences and much has been written on whether these represent response biases. In the design of the CORE—OM, we sought to minimise gender bias but had no belief in a ‘ gender free’ instrument. The results (Table 7) show moderate and statistically significant gender differences in the non-clinical samples for all domain scores except functioning. The differences in the clinical samples were smaller, with statistically significant differences on well-being and, narrowly, on risk. Clearly, gender should be taken into account when relating individual scores to referential data, but the effects of gender are small compared with effects of clinical v. non-clinical status.

View this table:
Table 7

Gender differences in scores for clinical and non-clinical samples

Correlations between domain scores

Given the interrelationship between clinical domains, scores were expected to be positively correlated. The correlations in Table 8 show that the risk items show lower correlations with the other scores, more so in the non-clinical than the clinical sample. The three other scores show high correlations with each other.

View this table:
Table 8

Correlations between Spearman's ρ values for clinical and non-clinical samples

Exploratory principal-component analysis

Principal-component analyses were conducted separately for the clinical and non-clinical samples. The scree plot for the non-clinical sample is shown in Fig. 2. This shows the very large proportion of the variance in the first component (38%) and the suggestion of an ‘elbow’ (i.e. a flatter ‘scree’) thereafter (Cattell, 1966), after three components.

Fig. 2

Scree plot for non-clinical sample (n=1009).

The pattern matrix after oblique rotation (Table 9) shows a clear separation of the items into a negatively worded group, a group made up largely of the risk items and a positively worded group. Figure 3 presents the scree plot for the clinical sample. Again, the pattern matrix suggests three components: a problem one, a risk one and a more positively worded one. However, the solution seems to differ in fine detail from that for the non-clinical sample (Table 10).

View this table:
Table 9

Pattern matrix for non-clinical sample

Fig. 3

Scree plot for clinical sample (n=713).

View this table:
Table 10

Pattern matrix for clinical sample

Sensitivity to change

To test for possible differences relating to the nature of problems and to differences in typical numbers of sessions offered, change was considered in relation to three settings: counselling in primary care, student counselling and a ‘clinical’ group comprising NHS psychotherapy and/or counselling services (i.e. the remainder of the overall sample). The results (Table 11) show substantial and highly statistically significant improvements on all scores for all three settings.

View this table:
Table 11

Grouped change data

Reliable and clinically significant change (RCSC)

The methods of classifying change as ‘reliable’ and as ‘ clinically significant’ address individual change rather than group mean change. Reliable change is that found only in 5% of cases if change were simply due to unreliability of measurement. Clinically significant change is what moves a person from a score more characteristic of a clinical population to a score more characteristic of a non-clinical population (Jacobson & Truax, 1991). The RCSC complements and extends grouped analyses (Evans et al, 1998). The referential data reported here give the cut-points shown in Table 12.

View this table:
Table 12

Male and female cut-off scores between clinical and non-clinical populations

Using those and the coefficient α values of 0.94 to calculate the reliable change criterion allows the change categories to be counted. The three possible categories of reliability of change are: small enough to fall within the range that would be seen by chance alone given reliability (‘not reliable’); reliable improvement; and reliable deterioration. The four categories of clinical significance of change are: stayed in the clinical range; stayed in the non-clinical range; changed from clinical to non-clinical (‘clinically significant improvement’); and changed from the non-clinical to the clinical (‘clinically significant deterioration’). Together, these give the 12 theoretically possible change categories seen in Table 13. Clearly, the ideal outcome is the one shown in bold: reliable and clinically significant improvement. A few patients will score too low on entry into therapy to show clinically significant improvement, whereas some will score highly on entry and improve reliably but not necessarily such that they end below the cut-point to be in the clinically significant improved range.

View this table:
Table 13

Reliable and clinically significant change

The majority of patients showed reliable improvement in all three groups. The clinical significance results were less impressive with a slight majority, except in the primary care sample that shows no clinically significant change. Very few showed either clinically significant or reliable deterioration. However, identifying these 19 people of the 281 (7%) who seem to have shown reliable, or clinically significant, deterioration would support case-level audit.

Without knowing more about the clinical services or about the non-response rates, it is premature to interpret either the grouped or the individually categorised change data comparatively. However, they underline that the measure is sensitive to, and can usefully categorise, change in all three settings.


This paper has presented the psychometric properties of a specifically developed ‘core’ outcome measure that has been developed through active interaction between researchers, clinician—researchers and pure clinicians. Our discussion focuses on three questions that are central to our original objectives: is the CORE-OM valid, reliable and sensitive to change; is it acceptable and accessible; and does it have wide utility?

Is the CORE-OM reliable, valid and sensitive to change?

The results presented are satisfactory. The CORE-OM and its domain scores show excellent internal consistency in large clinical and non-clinical samples. In addition, it has high 1-week test—retest reliability in a small sample of students. Convergent validation against a battery of existing measures and clinician ratings of risk is good. Gender differences are statistically significant in the non-clinical samples but less so in the clinical samples. Although sufficiently different to require gender-specific referential data, the differences are small enough in relation to the clinical/non-clinical differences to suggest that the measure is not heavily gender-biased. On this evidence, the CORE-OM meets the required standard for acceptable validity and reliability.

The very strong discrimination between clinical and non-clinical samples suggests that the measure is tuned to the distinction between clinical and non-clinical samples. The correlations between the domain scores and the principal-component analysis suggest that item responses across domains are highly correlated, both in clinical and non-clinical samples. A first component accounts for a large proportion of the variance, but a three-component structure that separates problems, risk items and positively scored items may be worthy of further exploration, particularly in relation to the phase model of change in psychotherapy (Howard et al, 1993). Change data from counselling in primary care, student counselling and NHS psychotherapies all suggest that the CORE-OM is sensitive to change and capable of categorising change using the methods of ‘reliable and clinically significant change’.

Is the CORE-OM acceptable and accessible?

The rates of omitted items are such that most scores can be pro-rated and the measure has good acceptability in clinical and non-clinical use. Non-completion rates were not assessed, so the results can be generalised only to the population of clients who are currently willing to complete such measures on the minimal encouragement available when a research project is spliced onto normal clinical practice. Work is now in progress with some sites to gain regular and detailed non-completion information and to explore residual practitioner and patient reluctance.

The non-clinical data-sets provide referential data on score distributions in British populations that are not available for many symptom measures in routine use. Further work is in progress to develop translations into other languages. In addition, two parallel, 18-item, single-sided short forms are available for services wishing to track progress session by session. Data on them will be reported separately.

Does the CORE-OM have wide utility?

The collaboration between practitioners and researchers has produced a reliable, valid and user-friendly core outcome measure that has clinical utility in a range of different settings. It has achieved its design aims. However, the aims went beyond creating ‘another measure’. The first intention was that the CORE-OM constitutes a ‘core’: something onto which other measures can be added (Barkham et al, 1998), as shown in Fig. 4. The CORE-OM then constitutes a common, available measure to pursue the broader goals of measurement of efficacy and effectiveness in psychological treatments. A report on its usage in one large service is given by Barkham et al (2001). However, to return to Thornicroft & Slade (2000) for a more general overview:

Fig. 4

Relationship between the CORE-OM and other outcome measures.

“Can mental health outcome measures be developed which meet the following three criteria: (1) standardised, (2) acceptable to clinicians, and (3) feasible for ongoing routine use?... ... implementing the routine use of outcome measures is a complex task involving the characteristics of the scales, the motivation and training of staff, and the wider clinical and organisational environment. ... When assessed using these criteria [applicability, acceptability and practicality] it is clear that our current knowledge tells us more about barriers to implementing routine outcome measures than about the necessary and sufficient ingredients for their successful translation into clinically meaningful everyday use.”

We believe that the CORE-OM and the CORE system provide a strong platform to amend their first assessment to ‘yes’, ‘largely’ and ‘generally’ for psychological therapies and we believe that the CORE-OM shows applicability, acceptability and practicality. However, we agree completely that much cultural change, in which practice-based evidence (Margison et al, 2000) must be given equal respect to evidence-based practice, will be needed for “successful translation into clinically meaningful everday use”.

Clinical Implications and Limitations


  • The Clinical Outcomes in Routine Evaluation — Outcome Measure (CORE-OM) can be hand-scored for immediate use or scanned by computer to facilitate audit of large clinical populations as part of clinical governance.

  • The CORE-OM is a reliable and valid instrument to use in clinical audit at the case, therapist or unit levels.

  • The CORE-OM provides a ‘lowest common denominator’ or ‘ common currency’ for assessing clinical effectiveness across models of therapy, complementing theory-specific case formulation and detailed, problem- or resource-specific measures.


  • The current referential data-set needs to be extended to fully represent populations.

  • More data on test—retest stability in different populations and over different time intervals are needed.

  • More data from the instrument in formal efficacy studies is needed to strengthen its use as a bridge between efficacy and routine evaluation.

  • Received July 13, 2000.
  • Revision received June 26, 2001.
  • Accepted September 27, 2001.


View Abstract