Background Use of bibliometric assessments of research quality is growing worldwide. So far, a narrow range of metrics have been applied across the whole of biomedical research. Without specific sets of metrics, appropriate to each sub-field of research, biased assessments of research excellence are possible.
Aims To discuss the measures used to evaluate the merits of psychiatric biomedical research, and to propose a new approach using a multidimensional selection of metrics appropriate to each particular field of medical research.
Method Three steps: (a) a definition of scientific ‘ domains’, (b) translating these into ‘filters’ to identify publications from bibliometric databases, leading to (c) the creation of standardised measures of merit.
Results We propose using: (a) established metrics such as impact factors and citation indices, (b) new derived measures such as the ‘ worldscale’ score, and (c) new indicators based on journal peer esteem, impact on clinical practice, medical education and health policy.
Conclusions No single index or metric can be used as a fair rating to compare nations, universities, research groups, or individual investigators across biomedical science. Rather, we propose using a multidimensional profile composed of a carefully selected array of such metrics.
The aims of this paper are to discuss the measures used to evaluate the merits of psychiatric biomedical research, and to propose a new approach using a multidimensional selection of appropriate measures. Such measures can be used, for example, to inform decisions on the allocation of research funds (Lewison et al, 1998) within an institution or nationally, or on academic promotions. The choice of which indicators to use is critical. Different single or combined measures can produce entirely different results and implications. We propose that quantitative evaluations of scientific merit in a particular field of biomedical research need a fair set of relevant standardised indicators, chosen according to the type of research being evaluated. The set relevant for each sub-field of research can combine established bibliometric assessments (e.g. impact factors or citation indices); derived measures such as the ‘ world scale’ score discussed below; and new indicators based on journal peer esteem, impact on clinical practice, medical education or health policy.
Our approach comprises three steps, beginning with a clear definition of scientific ‘domains’ with an agreed set of boundaries, which are then translated into ‘filters’ to identify relevant publications from bibliometric databases, finally leading to the creation of standardised measures of merit in biomedical research. A domain can be defined at three levels:
the major field, such as biomedical research;
the sub-field, such as psychiatric genetics;
the subject area of a research group, such as the genetics of Alzheimer’s disease.
Domain boundaries will not be universally accepted even if they start from agreed definitions (see below). It may be necessary to involve several experts in order to reach a consensus. Focusing at the second level (sub-fields), we illustrate each of these three steps using as examples the sub-fields of psychiatric genetics and health services research in mental health, which differ in representing respectively the more basic and the more applied ends of the research spectrum (Dawson, 1997; Horig & Pullman, 2004).
Defining the scientific domains of interest
We use the following definition of psychiatric genetics:
‘The role of genes in mental disorders, investigated through family
linkage or association studies (sometimes involving polymorphisms in candidate
genes). Effects may be observed through psychopathology, psychopharmacology,
personality, cognitive function or behavioural variation’
An understanding of health services research in relation to mental health also depends upon its definition. This has been given, for example, by the Medical Research Council in the UK (Medical Research Council, 2002), Academy Health in the USA (Academy Health, 2004) and the Health Services Research Hedges team (Wilczynski et al, 2004). From reviewing these we propose that health services research be defined as the multidisciplinary field of scientific investigation that:
describes healthcare needs, variations in access to services, and patterns and quality of healthcare provision;
evaluates the costs and outcomes of healthcare interventions for individuals to promote health, prevent or treat disease or improve rehabilitation;
determines ways of organising and delivering care for populations;
develops methods of disseminating evidence-based practice;
investigates the broader consequences of healthcare interventions, including acceptability, effects on carers and families, and the differential impact of interventions on subgroups of patients.
Health services research in mental health is therefore considered here as the application of this definition to mental disorders, their treatments and related services.
Developing filters to identify scientific publications in a specified domain
Filters have been developed over the past decade to identify in bibliometric databases publications that are relevant to the ‘cause, course, diagnosis or treatment’ of specific health problems (Haynes et al, 2005) to ‘aid clinicians, researchers and policy-makers harness high quality and relevant information’. For particular sub-fields, specialist journals have traditionally been used to identify relevant publications, but need to be supplemented with title words, often in combination (Lewison, 1996), because for many biomedical sub-fields two-thirds or more of the papers will be in ‘ general’ journals. Such a filter can achieve both a specificity (or precision) and a sensitivity (or recall) above 90%, and calibration methods have been described (Lewison, 1996).
In relation to our illustrative sub-fields, the filter developed for psychiatric genetics selected papers from the Science Citation Index if they were within both the subfields of genetics and mental health. Of the resulting scientific papers identified, 93% were relevant; this rose to 99% when those on ‘suicide genes’ (which are not related to mental health) were removed. The mental health services research filter was also based on the intersection of two separate filters (health services research and mental health), but that for health services research was much harder to define (Wilczynski et al, 2004). Although its specificity was as high as 0.93, the sensitivity only reached 0.59, as it proved difficult to list all the combinations of title words on many relevant papers. In principle, sensitivity can be improved by incorporation of additional title words or journals taken from false-negative papers from relevant departments. However, this may be at the expense of specificity if too many terms are included in the filter.
Established measures of the merits of biomedical psychiatric research
Research evaluation is concerned both with the volume of output and its quality. Regarding volume, the number of identified research papers can be used at a global, national or institutional level to consider whether the amount of research is commensurate with the associated disease burden (Lewison, 2005). There may be an international imbalance, as was shown by the Global Forum for Health Research, for example for AIDS (de Francisco, 2004). At the national level, the correlation between numbers of papers and global burden of disease was found to be good for deaths from gastric cancer (r2=0.90; Lewison et al, 2001) but very poor for those from lung cancer (r2=0.04; Rippon et al, 2005). At the institutional level, publication counts can be compared with inputs of money and personnel.
In relation to the quality of publications, the central problem is that most evaluations of scientific merit are limited to the number of citations of papers by other papers (as recorded in the Science Citation Index or the Social Science Citation Index), or to analyses of journal impact factors (Tsafrir & Reis, 1990; Seglen, 1997). These may be more appropriate for the basic science sub-fields such as psychiatric genetics, where citations are numerous, but may give a distorted view of applied clinical sub-fields, and so may prejudice applications for competitive funding.
The ‘world scale’ for assessing domains of research
To complement these two established measures, we propose a new scale assessing the relative scientific merit of a country or of an institution. This new scale is derived from citation scores or from journal impact factors. It is based on the concept of World-scale, an idea borrowed from the oil tanker charter market, in which the output from an entity (a country, an institution or an individual) is compared with that of the world at different levels of excellence. One might ask, for example, what percentage of the output of such an entity receives a citation score sufficient to place it in the top 10% of the world production in that particular domain: if it is more than 10%, this indicates a superior performance, and if it is less, then it is not so meritorious on the selected criterion. Similar calculations can be made at other centiles (e.g. 5%, 20%) and an average value determined, or one weighted to reflect the greater importance of performing well at the top levels. The 5-year citation window is used because it strikes a balance between the need to allow time for the papers to be properly judged by the scientific community and immediacy. World scale values could also be based on shorter (or longer) time windows.
Table 1 shows world scale values for UK and US papers in the selected sub-fields, using citation scores as the source data. For example, the USA has 25 papers from 706 in psychiatric genetics that received 112 or more citations in the given period, or 3.54% compared with the world norm of 2.03%, so its world scale value at the 2% centile was (3.54/2.03)×100=175. We can see that the USA has a superior performance over the whole range of centiles in both sub-fields; the UK is slightly better than average in psychiatric genetics, but below average (especially at the higher centiles) in mental health services research, probably because work in this area tends to be more specific to a country, with UK experience less relevant to USA researchers, who published over 53% of all the papers (a reason for distrusting citation scores alone in such a domain).
World scale values can also be calculated from journal impact factors, and the US values are 157 for psychiatric genetics and 136 for mental health services research. In comparison, the UK psychiatric genetics score is 94 and the mental health services research score is 98, which reverses the trend seen above in world scale values based upon citation counts.
Relative esteem value
A further measure that can complement impact factors and citation indices in assessing scientific merit is the ‘relative esteem value’ of journals (Lewison, 2002; Jones et al, 2004). This is determined from written questionnaires to researchers in a subdomain, which invite them to rate journals on a scale from ‘excellent’ to ‘ decidedly secondary’. For the more basic sub-fields there is a reasonable correlation with journal impact factors (about 0.6), but in more applied clinical specialties the correlation coefficient may drop to zero (Lewison, 2002), with some highly cited journals being of lower subjective esteem for communication of research results than some less frequently cited journal (Lee et al, 2002). Figure 1 shows the relationship between relative esteem value and impact factor for 29 leading journals in mental health services research.
World scale values can also be calculated using relative esteem values, as previously described (Lewison, 2004). Comparing the USA and UK for mental health services research using a world scale based upon such values, rated by 88 international researchers in the field, we found scores of 98 for the USA and 116 for the UK, again giving quite different results from the world scales from citation counts (144 for the USA and 81 for the UK). The relative standing of the scientific output from different countries (or institutions) in particular sub-fields can therefore be highly dependent upon the assessment measures used.
Multidimensional assessment of scientific merit
No single measure alone can therefore provide a stable and rounded assessment of merit within a scientific sub-field. Rather, we propose that a range of indicators be used for a full appreciation of the value of research in any particular sub-field. These may include a combination of some of the following: impact factors; citation indices; world scale values; or relative esteem values, along with counts of patents that cite references within the sub-field (mainly for basic research); citations in clinical guidelines (mainly for clinical studies) (Grant et al, 2000); citations in journals actually read routinely by clinicians; citations in newspapers that are read by policy makers, healthcare professionals, researchers and the general public; citations in governmental and international policy documents (Lewison, 2004); citations in relevant international standards; citations in textbooks, which can indicate an impact on medical education (Lewison, 2004); and presence of researchers on journal editorial boards.
The comparative quality of research in a country, for example, can be shown in a multidimensional graphical display such as a kite diagram. Figure 2 shows such kite diagrams for the USA and the UK in the two sub-fields across eight indicators of scientific merit. These profiles use a further measure, the ‘relative commitment’ score: this measures the amount of effort a country devotes to a scientific sub-field, compared with its overall biomedical research portfolio, relative to the world average (world scale). For example, the UK publishes about 17% of world papers in mental health services research, whereas its presence in the world biomedical literature is only 10%, so its relative commitment is 17/10×100=170 on the world scale index in this sub-field. Figure 2 therefore shows that the UK performs well in both sub-fields, but the next two indicators, moving clockwise, namely citations and journal impact, show that UK output has less impact than US publications in both sub-fields. Figure 2 also suggests that new indicators can be developed that quantify other important dimensions of research impact, such as informing clinical practice (Perneger, 2004), or enhanced patient safety (Agoritsas et al, 2005), for example as assessed through scientific paper citations in clinical guidelines or protocols. The other four indicators (patents, guidelines, and textbook and newspaper citations) are illustrated in Fig. 2 with dummy values, and can be determined in practice for such an evaluation to be complete. Such diagrams can also be used to compare institutions.
The range of assessment methods used in specific sub-fields of research should be appropriate to each case. In our examples, psychiatric genetics and mental health services research, it may be more important for the latter than for the former to influence health policy, treatment guidelines and clinical practice (Institute of Medicine, 2001). It is therefore reasonable to develop a range of such measures of health services research impact, but it would be unreasonable to apply all of them to assess psychiatric genetics. By the same token, if measures developed for the more basic biomedical sciences are used uncritically in the applied sciences, the latter may apparently perform poorly, and consequently suffer in terms of resource allocation (Dash et al, 2003).
However, more statistics do not mean better statistics. The key questions remain: who should make the choices between different measures of scientific merit; who should decide how these criteria are weighted; and with what agenda? We propose that the use of these measurements is best done within the context of the peer review process, as this is the strongest method so far devised to assure an overall appraisal of scientific merit. No single index or metric can be used as a fair rating to compare nations, universities, research groups or individual investigators between different sub-fields of science (Goldberg & Mann, 2006). Rather we propose that research oversight and peer review procedures refer not to any single measure of research quality (as is often the case at present), but refer simultaneously to a multidimensional profile, composed of a carefully selected array of such metrics (Martin, 1996), to construct a balanced and fair assessment of the merits of psychiatric research.
- Received March 30, 2006.
- Revision received August 25, 2006.
- Accepted October 27, 2006.
- © 2007 Royal College of Psychiatrists