There is concern over the methods used to evaluate antipsychotic drugs.


To assess the clinical relevance of findings in the literature.


A systematic review identified studies of antipsychotics that used the Brief Psychiatric Rating Scale (BPRS) and Positive and Negative Syndrome Scale (PANSS). A published method of translating these into Clinical Global Impression – Change scale (CGI–C) scores was used to measure clinical relevance.


In total 98 data-sets were included in the BPRS analysis and 202 data-sets in the PANSS analysis. When aggregated scores were translated into notional CGI–C scores, most drugs reached ‘minimal improvement’ on the BPRS, but few reached that level for PANSS. This was true of both first- and second-generation drugs, including clozapine. Amisulpride and olanzapine had better than average CGI–C scores.


Our findings show improvements of limited clinical relevance. The CGI–C scores were better for the BPRS than for the PANSS.

The introduction of a second generation of antipsychotic drug has created a large literature concerning their therapeutic value and tolerability. Recently there has been concern over the reliability of this literature. The effects of pharmaceutical industry funding, publication bias, the appropriateness of first-generation comparators (particular haloperidol) and a range of other methodological issues have come under close scrutiny.14 Two large independent and highly credible studies, Clinical Antipsychotic Trials of Antipsychotic Effectiveness (CATIE)5 and Cost Utility of the Latest Antipsychotic Drugs in Schizophrenia Study (CutLASS)6 have cast serious doubt on the idea that second-generation antipsychotics are necessarily superior to older drugs in terms of either effectiveness or tolerability. The CATIE findings suggest that second-generation antipsychotics may not be similar to each other in effectiveness. Leucht and colleagues have produced a series of studies challenging the appropriateness of the outcome measures commonly used in antipsychotic trials. Their studies examine thresholds for reduction in Positive and Negative Syndrome Scale (PANSS)7, 8 and Brief Psychiatric Rating Scale (BPRS)9 scores that are generally taken to indicate response or remission. They have demonstrated that changes in PANSS and BPRS scales can be translated into notional Clinical Global Impression scale severity and change (CGI–S and CGI–C) scores. These are scales that quantify clinicians’ overall impression of clinical severity and clinical change in individuals’ psychiatric condition. They are intended to introduce clinical meaning into drug trial findings. Leucht and colleagues have suggested that commonly accepted thresholds for determining a response (for example, 20% reduction in PANSS score) may generate statistically significant findings; however, the corresponding clinical improvement is small. Their method has some limitations, but it appears to generate an acceptably robust translation. This creates the opportunity to make an objective and reliable estimate of the clinical relevance of published findings.

Our study uses a systematic review methodology to assess the clinical relevance of the findings of the literature as a whole, by employing the method of Leucht and colleagues to translate PANSS and BPRS scores into CGI–S and CGI–C scores ( Fig. 1). We do not seek to draw conclusions regarding the utility of antipsychotic drugs in clinical practice. We are interested in the clinical relevance of the findings in the literature, because this is the main body of scientific evidence that guides clinical prescribing practice and it is commonly used by the pharmaceutical industry to support their claims regarding the usefulness of their products.

Fig. 1

Conversion of Positive and Negative Syndrome Scale (PANSS) and Brief Psychiatric Rating Scale (BPRS) scores to Clinical Global Impression – Change (CGI–C) scores, adapted from Leucht et al.8, 9


In order to systematically review the literature we drew upon Davis and colleagues’10 systematic review of second-generation antipsychotic drug evaluation studies published up to (and including) 2001. We included in our search all studies that they had listed as references. We also conducted an electronic search for studies published between 2002 and 31 December 2007, using the same search terms. The inclusion criteria for our review were:

  1. participants: a diagnosis of schizophrenia or schizoaffective disorder;

  2. interventions: at least one second-generation antipsychotic drug;

  3. comparator: no comparator necessary;

  4. outcome measures: change in mean BPRS and/or PANSS score from baseline to end-point;

  5. antipsychotics: any antipsychotic that is or has been licensed;

  6. design: at least single-group pre–post design;

  7. reporting: published in a peer-reviewed journal, listed by Davies et al or by one of the three databases listed below. Books and conference posters were excluded; published in English, German or French; sample size for each study arm reported; available as electronic full text or as paper full text.

Davis et al applied the following search term to MEDLINE, PsycINFO and CINAHL to identify studies for their review: ((amisulpride OR aripiprazole OR clozapine OR olanzapine OR quetiapine fumarate OR remoxipride hydrochloride OR risperidone OR sertindole OR ziprasidone hydrochloride OR zotepine) AND (schizophrenia OR schizoaffective disorder) AND (BPRS OR PANSS)). We used this search term in our update.

We attempted to eliminate factors that would tend to systematically bias against positive findings for the index drug: we excluded studies that reported median scores in heterogeneous samples, as means usually show better results than medians in such samples. In other words, studies that reported results as medians were only included if they showed homogeneity with means. Studies were excluded if participants were essentially healthy at baseline, as this creates reduced scope for clinical improvement. Studies were included if the baseline mean scores were, using the PANSS, 58 and above, or using the BPRS, 31 and above, indicating significant illness. Using Leucht et al’s method8, 9 this corresponds to a CGI–S score of at least 3 (mildly ill). We excluded all samples that reported the use of age-adapted subtherapeutic doses as defined by the manufacturers.

In a second step, we attempted to identify studies that used a single sample and data-set in multiple publications. As far as practically possible, we used each data-set only once and removed all repeat data.

In keeping with Leucht’s methodology, both PANSS and BPRS scores were adjusted to allow meaningful calculation of percentage change. The minimum possible score on PANSS (no symptoms) is 30, so this figure was subtracted to create an adjusted PANSS score. The corresponding score on BPRS is 18, which was similarly subtracted to produce an adjusted BPRS score. A mean, weighted for sample size, was calculated for aggregated BPRS and PANSS scores at baseline and follow-up. Mean change and percentage mean change were also calculated. These were calculated from adjusted scores (our figures show the unadjusted PANSS and BPRS scores as originally published). Percentage changes were calculated for: all active drugs combined; first- v. second-generation antipsychotic v. placebo; each individual drug.

The independent sample t-test was used to determine homogeneity between median and mean BPRS scores in the comparison of first- and second-generation antipsychotics. The test showed that there was homogeneity (P>0.05). Consequently, the median and mean BPRS scores were combined for further analysis. There were no medians reported in the PANSS samples. We translated all aggregated PANSS and BPRS scores into notional approximate CGI–C scores, using the graphs published by Leucht and colleagues.

Data were extracted by two of the authors. In order to estimate intercoder reliability, each of them extracted data from 15 of the studies independently. The baseline mean PANSS or BPRS scores that they had derived from the studies were compared. Of the 15 values, 14 were identical (93% agreement, intraclass correlation 0.99).

All data were expressed as a CGI–C equivalent change using the graphs produced by Leucht. We used the latest possible follow-up point in each study for the CGI–C translation in order to give the medication the longest possible time to have an effect, and thus avoid any bias against medication because of the short duration of follow-up. The figures are approximated with an accuracy margin of 0.05 as they derive from a non-linear graph. A CGI–C of –1.0 is seen as a measure for response to an antipsychotic drug equating to a PANSS score percentage change of 28 or a BRPS score percentage change of 30. The core aim of this study was to convert BPRS/PANSS scores into CGI–C equivalents and not to estimate the effect size for each drug. The available conversion tables8, 9 are based on percentage reduction from baseline and CGI–C scores, and so results below are reported in these terms rather than in terms of effect sizes.

Data-loading reliability for each of the coders was assessed. All coded data from 10 studies for each coder were compared across paper and electronic recording sheets (49 data-points and 64 data-points respectively). Agreement across versions was 98% and 100% respectively.


The initial search process generated 678 titles ( Fig. 2). Of these 114 came from the reference list in Davis et al10 and 564 were identified in our 2002–2007 search. In a second stage we excluded those papers that did not fulfil our inclusion criteria or could not be obtained. We located 211 full-text articles in full-text version from the 678 titles originally generated. Of these, 91 papers were then excluded for the following reasons: wrong data format (for example scores reported graphically, not numerically), no new data reported, insufficient data/information (for example no sample size), baseline BPRS/PANSS below threshold and other reasons.

Fig. 2

Flow diagram of studies and data-sets included.

BPRS, Brief Psychiatric Rating Scale; PANSS, Positive and Negative Syndrome Scale.

The remaining 120 studies (online supplement) had between 1 and 7 study arms and yielded 300 separate data-sets relating to a second-generation antipsychotic. In total, 98 data-sets were included in the BPRS analysis (12 different drugs) and 202 data-sets were included in the PANSS analysis (15 different drugs). A total of 22 428 participants were included in the PANSS scores analysis, and a total of 9772 in the BPRS scores analysis. There were no statistically significant differences in baseline PANSS and BPRS score between the three main groups (first-, second-generation antipsychotics, placebo).

Table 1 shows the results of the analysis of PANSS scores including the total sample size at baseline, the number of data-sets included, mean absolute PANSS scores at baseline and post-treatment, aggregated mean change in scores in absolute numbers and as percentages, and notional CGI–C scores. Table 2 shows the corresponding findings for BPRS.

View this table:
Table 1

Summary results Positive and Negative Syndrome Scale (PANSS): reported non-adjusted PANSS mean/median values

View this table:
Table 2

Summary results Brief Psychiatric Rating Scale (BPRS): reported non-adjusted BPRS mean/median values

In the PANSS sample, the CGI–C score for most of the drugs clustered around –0.5 to –1.2 (0 is no change and –1 is the threshold for ‘minimal improvement’). The exceptions were chlorpromazine with a CGI–C score of –0.2, and amisulpride with a CGI–C score of –2.2 (much improved). In the BPRS sample, CGI–C scores were larger, the majority of drugs scoring between –1 and –1.9 (minimal improvement). Sertindole was below the –1 threshold, and chlorpromazine was far below it. Only olanzapine scored better than –2 (much improved). See Fig. 3 for a summary of the results.

Fig. 3

Summary of results of all oral drugs with at least two studies or more than 100 participants in total. Ranked by Positive and Negative Syndrome Scale (PANSS) percentage reduction (28% reduction needed for ‘minimally improved’, 53% for ‘much improved’).

BPRS, Brief Psychiatric Rating Scale; AMI, amisulpride; OLA, olanzapine; SGA, all second-generation antipsychotics combined; RIS, risperidone; CLO, clozapine; QUE, quetiapine; HAL, haloperidol; FGA, all first-generation antipsychotics combined; ARI, aripiprazole; ZIP, ziprasidone; CPZ, chlorpromazine.


Main findings

The published trial literature consistently reports improvement in participants with schizophrenia and other psychotic illnesses when they are given antipsychotic drugs. However, our study suggests that on average the clinical significance of the reported findings is disappointingly limited. There is little difference in this regard between first- and second-generation drugs as categories, although there are differences between individual drugs.

Broadly speaking, our findings are similar to other recent studies, both with regard to the similarities between first- and second-generation antipsychotics as categories, for example, CutLASS,6 and with regard to differences between the different second-generation antipsychotics, for example CATIE5 and two recent meta-analyses.4, 10 In the reported studies, amisulpride and olanzapine appear, by and large, to lead to a greater overall clinical improvement than other drugs. Ziprasidone, quetiapine, sertindole, aripiprazole and chlorpromazine appear to perform relatively poorly. However, only the most effective drugs produce a moderate improvement on CGI–C when PANSS and BPRS changes are translated into notional CGI–C scores. No drug achieved this on both the PANSS and BPRS score. Interestingly, even clozapine does not perform well in our analysis. This stands in contrast to other studies, despite including the CATIE results in our sample. However, in most, if not all other studies clozapine was used for individuals with defined therapy-resistance. Therefore it is not surprising that the clinical effects observed were only moderate.

Comparison of first- and second-generation antipsychotics as categories shows neither class of drug achieving a mean CGI–C score of great clinical significance. Clinical Global Impression – Change scores derived from PANSS and BPRS scores show a similar pattern between drugs, but BPRS scores translate into greater clinical improvement than PANSS scores. This may be a result of the greater prominence of negative symptoms in the PANSS scale.

Drug trials are conducted using groups of participants that differ considerably from clinical populations with acute illnesses. Naturalistic studies of individuals under treatment for acute schizophrenia-spectrum disorder may show greater improvements in PANSS scores than those seen in drug trials. For example, Jäger et al11 conducted a naturalistic study of 280 in-patients with schizophrenia. There was a mean reduction in PANSS scores of 47% (equivalent to a CGI–C score of – 1.75) between admission and discharge. The PANSS reduction was 25% (equivalent to a CGI–C of –0.8) in the subsample of individuals with a history of multiple admissions, a result much more akin to our findings. The PANSS reduction for the total sample was unusually large compared with the drug trial literature. The authors suggested that this was the result of the individuals being treated holistically in a naturalistic study, together with the inclusion of a larger proportion of individuals with first-episode psychosis than would be possible in a drug trial. This underlines the value of naturalistic studies, where conditions and, it seems, findings correspond more closely to clinical experience.

Other explanations for our findings include the possibility that the clinical relevance of antipsychotics is in fact limited. Alternatively, symptomatology may only be a small aspect in individuals’ overall assessment of their quality of life.12 In addition, our method of aggregating and averaging results may hide the larger improvements that certain individuals gain from antipsychotics. With regard to first- and second-generation antipsychotics as categories, poorly performing drugs may have masked the admittedly limited effects of better performing drugs. Other reasons or combinations of reasons may also apply.


Our study has some limitations. As is commonly the case in a systematic review, a large number of studies were excluded for a variety of reasons. Although we have not included the whole literature, it is reasonable to suppose that we have captured a very large proportion and included the most prominent and best conducted published studies. There is no reason to suppose that the exclusion of studies has created a bias to minimise the clinical effects of antipsychotic drugs. Leucht et al9 have pointed out that their method of converting PANSS and BPRS continuous scores into CGI categorical scores involves the translation of psychometrically validated instruments into impressionistic scales using conversion graphs that are not perfectly linear. We have attempted to partially overcome this by reporting CGI scores to two decimal places, to indicate proximity to threshold values on the CGI. We have not examined the length of studies as a factor affecting outcome, but the vast majority of studies were between 6 and 24 weeks long. There is, however, increasing evidence that much of the antipsychotic effect that can be measured with PANSS and BPRS scores occurs in the first 2 weeks of treatment.13 This suggests that it is unlikely that study duration accounts for our findings.

Clinical Antipsychotic Trials of Antipsychotic Effectiveness (CATIE) and CutLASS investigated effectiveness through a variety of different outcome criteria, whereas the majority of studies we included were designed to examine efficacy. However, our review included the results from CATIE and our findings are in keeping with other recent meta-analyses4 that included efficacy studies. We do not believe that the limitations of our study invalidate the principal findings.


We would caution against drawing the conclusion that we have shown that antipsychotic drugs have negligible effects in clinical practice. Our data have been drawn from a literature that has well-recognised faults, such as short length of drug exposure, low dosage regimes for comparator drugs, studies conducted on participants who are sometimes only mildly ill at baseline (which limits the potential for improvement), publication bias and so on. This is a study of the literature, and we have shown that the findings of that literature are of limited clinical significance. As a consequence, caution should be exercised when drawing conclusions from this literature about the clinical usefulness of these drugs.

Our findings lend considerable support to recent proposals that higher outcome thresholds should be applied to PANSS and BPRS scores in future studies.1417 We also support the recent suggestion to rescale the PANSS in order to make comparisons such as ours easier to calculate.18 We would further suggest that measures of clinical relevance should be included alongside measures of changes in psychopathology in all drug trials.


We would like to thank Stefan Leucht for providing us with information about his conversion graphs. We would like to thank the PANSS Institute for their advice on cut-off points.

  • Received November 10, 2009.
  • Revision received September 15, 2010.
  • Accepted October 6, 2010.


View Abstract