Expert clinical judgement combines technical proficiency with humanistic qualities.


To test the psychometric properties of questionnaires to assess the humanistic qualities of working with colleagues and relating to patients using multisource feedback.


Analysis of self-ratings by 347 consultant psychiatrists and ratings by 4422 colleagues and 6657 patients.


Mean effectiveness as rated by self, colleagues and patients, was 4.6, 5.0 and 5.2 respectively (where 1=very low and 6=excellent). The instruments are internally consistent (Cronbach’s alpha >0.95). Principal components analysis of the colleague questionnaire yielded seven factors that explain 70.2% of the variance and accord with the domain structure. Colleague and patient ratings correlate with one another (r=0.39, P<0.001) but not with the self-rating. Ratings from 13 colleagues and 25 patients are required to achieve a generalisability coefficient (Eρ2) of 0.75.


Reliable 360-degree assessment of humane judgement is feasible for psychiatrists who work in large multiprofessional teams and who have large case-loads.

Expert clinical judgement combines technical proficiency, which is derived from scientific understanding, with humanistic qualities. The latter are of particular importance in psychiatry where interprofessional teamwork and a good working relationship with the patient are both essential to high-quality care. Techniques to measure the technical competency of doctors are better developed than those to measure the non-technical aspects of clinical practice. One reason is that important aspects of technical proficiency are observable as behaviours and therefore are measurable using, for example, clinical audit and workplace-based assessment of competencies and skills. Our intention has been to develop and test an approach to assessing the qualities that constitute good humane judgement by senior psychiatrists (consultants). This paper describes the development of a 360-degree assessment system and the analysis of the results for the first 347 psychiatrists to participate.


Development of the questionnaires

Research ethics approval was obtained for all stages of the work to develop the questionnaires.

Theoretical background

Two of the seven core domains defined by the UK’s General Medical Council (GMC) as being central to good medical practice encapsulate the humane qualities that are required by doctors.1 The domain ‘ relationships with patients’ includes good communication, being open and honest, and the quality of the doctor’s relationships with the relatives, carers and partners of patients. The domain ‘working with colleagues’ includes effective teamworking, respect for colleagues and appropriate sharing of information.

We set out to express these desirable interpersonal qualities as observable behaviours and then to measure them using multisource feedback from the colleagues and patients affected by the consultant’s work performance. This type of ‘360-degree assessment’ has been applied to physicians and surgeons.26

The stated purpose of assessment may affect the ratings given by an appraiser.7 We therefore informed all involved in developing and testing the questionnaires that their purpose was to give constructive feedback to the consultant as part of a developmental, ‘formative’ process with an emphasis on personal and professional development; for example, as one component of appraisal. It was not intended that the results be used as a summative judgement, for example about fitness to practice.

Identification of items and initial testing

We used the critical incident technique8,9 in interviews with 24 specialist mental healthcare workers (four consultant and six non-consultant grade psychiatrists, five nurses, three other clinical staff and six managers). We asked interviewees to think of the last time they had seen a consultant do something that was particularly effective and something that was particularly ineffective in relating to patients or working with colleagues. For each event, the interviewee was asked: what was the situation; what events led up to it; what exactly happened; why was it effective/ineffective; and what was the outcome?

Two raters independently undertook a thematic analysis of interview transcripts and then met to generate the first index of behavioural competency items. This was presented to a focus group comprising four psychiatrists, two nurses, one other practitioner, three managers and two medical secretaries. Participants ranked and weighted the behavioural competencies and used the repertory grid technique10 to draw on their personal experience to relate the competencies to the expected performance of: an expert psychiatrist; a novice psychiatrist; an experienced psychiatrist but one who would not be considered an expert; the worst psychiatrist with whom they have worked; an expert, nonpsychiatric doctor; and a trainee psychiatrist.

The project team edited the items to ensure that each described a different behaviour and that each was worded so that its meaning was unambiguous. The items were grouped into domains to create the first draft questionnaires – one for self-rating and one for completion by colleagues. Minor revisions were made following an initial test of face validity and feasibility by eight consultants and their colleagues. Finally, a subset of 17 items was selected for a questionnaire that could be completed by patients.


Fifty-one consultant psychiatrists participated in the pilot. They had volunteered in response to a letter of invitation sent to members of the Faculty of General and Community Psychiatry of the Royal College of Psychiatrists. The 51 consultants, 609 of their colleagues and 937 of their patients completed questionnaires. The consultants and the colleagues were also asked to rate the importance of each item on a scale of 1=unimportant to 6= essential.

The mean rating of importance for the 59 items was 5.1 for the self-rating questionnaire and 5.3 for the colleague version. The lowest mean rating of importance for any individual item was 4.4; a score of 4 on the importance scale equates to ‘important’. The inter-item correlation between two pairs of items in both the self-rating and colleague questionnaire was above 0.8. One item of each pair was dropped leaving the same 57 items in the final versions of the self-rating and colleague questionnaires.

The final questionnaires

The 57 items in the self-rating and colleague questionnaires cover nine domains:

  1. communication (six items)

  2. availability (four items)

  3. emotional intelligence (eight items)

  4. decision-making (seven items)

  5. relationships with patients (nine items)

  6. relationships with patients’ relatives, partners and carers (five items)

  7. relationships with consultant colleagues (four items)

  8. relationships with junior doctors (six items)

  9. relationships with the team and external agencies (eight items).

Each item describes briefly a behaviour that relates to the domain in question. For example, items in the emotional intelligence domain include ‘ offering reassurance when appropriate’, ‘being consistently respectful of others’ and ‘being willing to take advice from others’.

The patient questionnaire contains 17 items. The behaviours that these describe have their counterparts in items in the first six domains of the self-rating and colleague questionnaires.

All three questionnaires require that each person who completes it rates the effectiveness of the subject for each item by scoring observed behaviours (from 1=very low to 6=excellent). The instructions for the colleague questionnaire state that the purpose of the assessment is so that the consultant can ‘learn about themselves and develop’. They ask each person who completes the questionnaire to be honest, to ensure that one aspect of the consultant’s personality does not influence all ratings, to use current behaviour as the basis for ratings and to be neither overly favourable nor overly critical. The patient questionnaire states that the purpose of assessment is so that the consultant ‘can improve the quality of care he or she provides’. Both questionnaires state that a consultant will not be able to identify the rating made by any individual responder.

The three questionnaires that resulted from the process of development and initial testing then became the core of a new 360-degree assessment service for consultant psychiatrists (ACP 360) that is offered by the Royal College of Psychiatrists. The full questionnaires, and their accompanying instructions, can be viewed at www.rcpsych.ac.uk/crtu/centreforqualityimprovement/acp360.aspx.


The ACP 360 was launched by the Royal College of Psychiatrists in Autumn 2005. A recruitment letter was sent to all consultant psychiatrists who work with individuals of working age in the UK. There was a charge for participation. We present here the results of an analysis of the returns from the first 347 participants.

The assessment process

Each consultant was sent a pack, which included written instructions. They were asked to complete a rating questionnaire about themselves, either in paper form or online, and to select at least 15 colleagues and 30 patients to act as appraisers. It was recommended that the selected colleagues included one line manager (clinical/medical director or chief executive), four consultant psychiatrists, seven other clinical colleagues (such as team manager, nurses, social workers, occupational therapists, psychologists, junior doctors) and three non-clinical colleagues (such as secretaries, clerks or administrators). Consultants were asked to select patients with whom they had had significant and recent contact and for their selection to reflect their case-load in terms of gender, age, ethnicity and diagnosis.

Each consultant requested each of their colleagues to complete a questionnaire, either online or in paper form, and each consultant posted a paper questionnaire to the 30 patients together with a standard letter. Colleagues and patients sealed paper questionnaires in a pre-addressed envelope and returned them to the Royal College of Psychiatrists. The system is designed so that consultants do not know which particular persons have returned a questionnaire and they never see any completed questionnaires. The instructions explained this arrangement to colleague and patient raters. Ratings from a minimum of 10 colleagues and 10 patients were required before the returns were analysed. If this threshold had not been reached within a set period, the consultant concerned was asked to send a reminder letter to all 15 colleagues and 30 patients.

Data analysis

We examined questionnaires and ratings from three perspectives: self, colleague and patient, using descriptive statistics. We tested internal consistency using Cronbach’s alpha. We used principal components analysis with varimax rotation, to study the structure of the colleague and patient questionnaires. We examined interrater reliability of both the colleague and patient questionnaires using intraclass correlations and used the generalisability coefficient,11,12 denoted as Eρ2, to estimate the number of colleague and patient ratings required to undertake a meaningful assessment of a consultant using this approach. Finally, we examined the relationships between the three measurement perspectives (self, colleague and patient) using Pearson correlation coefficients. All analyses were conducted using SPSS version 14.0.


Response rates

A total of 4422 colleagues (a mean of 12.7 per consultant; range 10–17) and 6657 patients (a mean of 19.2 per consultant; range 10–31) rated the 347 consultants.

Missing data

A mean of 2.0% of data items were missing for the self-rating questionnaire; no single item was left unrated by more than 33 consultants (9.5% of the total). As regards the 57 items of the colleague questionnaire, 28 items were left unrated by fewer than 5% of colleagues, 20 by between 5% and 10%, and 9 by more than 10%. All 9 items with more than 10% of missing data were in the domains concerning relationships with consultant peers and with junior doctors. The percentage of missing data ranged from 10.4% to 21.1% for items in these two domains. Overall, 13 of the 17 items in the patient questionnaire had fewer than 5% missing data. The exceptions were 3 items relating to carers and family members (range of missing data, 12.8% to 19.4%) and an item asking whether the consultant remained calm under pressure (6.2% missing data). In subsequent analyses, missing data were replaced by mean imputation when necessary.

Ratings of effectiveness

The mean effectiveness ratings for all consultants for all items were 4.6 for the self-rating (s.d.=0.9), 5.0 for the colleague rating (s.d.=0.9) and 5.2 for the patient rating (s.d.=1.0). Overall, both colleagues (paired t-test: t=12.1, P<0.001) and patients (paired t-test: t=18.0, P<0.001) gave significantly higher ratings to the consultants than the consultants gave to themselves. Mean ratings of effectiveness were high (above 4.0) for all domains for both self- and colleague ratings (Table 1).

View this table:
Table 1

Mean domain scores for the self-ratings (n=347) and ratings by colleagues (n=4422)

Internal consistency and structure of the questionnaires

Cronbach’s alpha for the self-, colleague- and patient-rated questionnaires were 0.98, 0.98 and 0.97 respectively. It is considered that a coefficient above 0.8 indicates adequate internal consistency and reliability.

The principal components analysis, with varimax rotation, of the colleague ratings, yielded seven factors with an eigen value greater than 1. These seven factors accounted for 70.2% of the total variance in the data. These factors were highly consistent with the structure of the questionnaire, as defined by items having a factor loading greater than 0.4 (Table 2). Five of the factors contained all the items in the corresponding domain. The principal components analysis of the patient ratings yielded a single factor that explained 66.8% of the total variance. This factor included all 17 items.

View this table:
Table 2

The seven factors derived from the principal components analysis of ratings by colleagues and how these map to the questionnaire domains

Interrater reliability and generalisability

The intraclass correlation coefficient for the colleague questionnaire was 0.75. Ratings from 10 colleagues are required for an Eρ2≥0.70; and from 13 colleagues for an Eρ2≥0.75. Overall, 198 consultants (57%) achieved returns from 13 or more colleagues. For the patient questionnaire, the intraclass correlation coefficient was 0.70. A total of 19 patient ratings are required for an Eρ2≥0.70 and 25 for an Eρ2≥0.75. The lower number of patient returns was achieved by 183 consultants (53%) and the higher by 57 consultants (16%).

Relationship between the different ratings

The global self-rating, expressed as a mean of all items, was not significantly correlated with either the global colleague rating (r=0.06, P=0.29) or the global patient rating (r=0.01, P=0.82). The correlation between the global colleague rating and the global patient rating was significant (r=0.39, P<0.001). Table 3 shows the correlations between the equivalent domain scores for self- and colleague ratings, and between the patient ratings and the domain scores for both the self- and colleague ratings.

View this table:
Table 3

Pearson correlation coefficients between the self- and colleague, self- and patient, and patient and colleague ratings


The properties of ACP 360 and its limitations

Consistent with multisource feedback tools developed for other specialty groups, ACP 360 ratings from all three groups of raters are skewed to the positive end of the scale and colleagues tend to rate consultants higher than consultants rate themselves.5,6,13

With ACP 360, the appraisees decide which colleagues and patients will assess them. It is possible that the high scores are at least in part because of the exclusion of colleagues and patients that consultants believe would rate them poorly. However, there is some evidence that colleague ratings obtained by multisource feedback are the same whether colleagues are selected at random to participate or whether the doctor concerned makes the selection.2 Also, a consultant working in a mental health team would need to involve a substantial proportion of colleagues to achieve the required number of ratings. We could find no study that examined the effect of method of selection on the ratings made by patients. It would be difficult to exclude all selection bias for patient raters because it is the psychiatrist concerned who is best placed to decide which patients should be asked. For example, some patients may lack the capacity to complete a questionnaire or might be distressed by being asked.

A weakness of both the colleague and patient questionnaires is that a substantial number of colleagues and patients are unable to rate some items. In particular, some colleagues are unable to assess the consultant’s relationship with other doctors. This might be because, in the UK, some psychiatrists work in partial isolation from other consultants. Also, colleagues from other disciplines might have little opportunity to observe the working relationship between consultants and junior doctors. Some patients are unable to rate items concerning the relationship between the consultant and the patient’s carers and family members. Perhaps carers and family members should be included as a separate, fourth group of raters in the 360-degree assessment process.

Three independent sources provide evidence of face and content validity for the instruments. First, the items were selected to cover certain components of performance described by the GMC as central to good medical practice1 and as extended for psychiatrists by the Royal College of Psychiatrists.14 Second, the development process involved three phases of research and consultation that involved both psychiatrists and those colleagues who would be asked to rate the psychiatrists. Third, those people who participated in the second, full-scale pilot rated all items as important, in both the self- and the colleague questionnaires. The high rate of return from both colleagues and patients suggests that raters did not find the questionnaires over-burdensome; although the respondents were probably highly motivated because they or their service had paid for them to participate. The disaggregation of complex humane attributes, such as ‘ availability’, into a number of items that describe actual, observable behaviours might have made the questionnaires easier to use.

The three questionnaires show good internal consistency. Also, the colleague questionnaire has a factor structure that adheres to the domains covered by the items. Furthermore, both the domains that emerged from the qualitative work of developing the questionnaires, and the factors that were derived from the principal components analysis of the colleague questionnaire, are meaningful and mutually supporting. The patient questionnaire has a different structure to the colleague questionnaire. The 17 items, which correspond to items drawn from six of the domains of colleague questionnaire domains, form a single factor.

As with other 360-degree assessment instruments, which have been criticised for their low interrater reliability,15 the intraclass correlations were modest for both the colleague and patient questionnaires. It has been proposed that an Eρ2 of 0.75 is the minimum requirement for generalisability for instruments used in multisource feedback.15 Using this benchmark, only 57% and 16% of participating consultants achieved sufficient returns for the colleague and patient questionnaires respectively. As a result of this analysis we have increased the target number of returns for consultants participating in ACP 360 to 13 colleague and 25 patient questionnaires. This number of colleague ratings can be achieved by consultants who, as in the UK, work as part of a large, multiprofessional team with a team of administrative support staff that work closely with the consultant. Some consultants participating in ACP 360 extend their colleague cohort by including general practitioners who take on shared care responsibilities. The ACP 360 would not be suitable for psychiatrists who work in isolation and/or have small case-loads of patients.

A recent systematic review of studies that compared physicians’ self-rated assessments with those of external observers concluded that ‘ the preponderance of evidence suggests that physicians have a limited ability to self-assess’.16 The lack of correlation between self-ratings using ACP 360 and those made by colleagues is consistent with this conclusion. The authors of the review argue that this feature of self-assessment justifies the introduction of multisource feedback, ‘particularly when interpersonal skills, communication skills or professionalism need to be evaluated’.16

In common with other multisource feedback instruments, we have not formally tested the validity of the ACP 360 questionnaires.15 There is no gold standard test with which to compare the results and in the UK there is no effective system for grading the performance of doctors other than at the extremes where doctors are subject to disciplinary procedures by their employers or have sanctions imposed by the GMC. However, colleagues and patients are two independent groups of raters whose knowledge of consultants is derived from quite different perspectives. Arguably, the fact that their ratings are significantly correlated is a measure of concurrent validity.

The role of 360-degree assessment in measuring the performance of doctors

Consultants from all medical specialties, who work in the four UK National Health Services, are subject to annual appraisal of their roles, work and performance. In the UK, structured feedback on performance is also used in other ways. It is an important element of the annual review of each consultant’s job plan that is part of their contract of employment. It will play an important role in the new systems being introduced for organising and assessing medical training at all levels17,18 and will be a core part of the new procedures for revalidation.19

The Chief Medical Officer for England has stated that ‘while patients want their doctors to have good clinical knowledge and technical skills, they also rate the interpersonal aspects of care as equally, if not more, important’.20 Although critics have questioned whether humanistic qualities can be measured validly,15,20 360-degree assessment, which will be a requirement for relicensing by the GMC,19 is the assessment approach most likely to give meaningful results.

Psychiatrists participate in the Physician Achievement Review (PAR) Program that is managed by the College of Physicians and Surgeons of Alberta in Canada22 and is probably the longest established 360-degree assessment system for doctors. However, the PAR questionnaires that are applied to psychiatrists have been adapted from questionnaires designed for general physicians. The authors are not aware of any reports of 360-degree assessment systems developed specifically for psychiatrists or whose focus has been specifically on the humanistic aspects of psychiatric practice.

Implementation of ACP 360

The Royal College of Psychiatrists has introduced ACP 360 as a service for its members in the UK (www.rcpsych.ac.uk/crtu/centreforqualityimprovement/acp360.aspx). Participation is voluntary and the system is funded by a subscription fee that is paid either by the consultant or by their employing organisation. The patient questionnaire has been adapted for assessing psychiatrists who work with children and adolescents and for psychiatrists who work with people with an intellectual disability and with older people. We are also exploring the use of the system for assessing non-consultant grade psychiatrists.

The ACP 360 has been designed, tested and implemented for formative purposes. It is not intended as a summative review but to inform consultants about how they might focus their personal development plans in order to continue to develop their performance as a component of striving towards excellence. Therefore, its purpose is to give feedback to individual consultants and the results are provided confidentially to each consultant who takes part. However, to provide a comparator, the ratings for each participant are also presented in the context of a mean ‘benchmark’ rating of all consultants who have participated previously.


The Wales Office for Research and Development funded the qualitative development work. Gwent Healthcare NHS Trust funded the first pilot, which was undertaken by the Welsh Institute for Health and Social Care in the University of Glamorgan and the National Leadership and Innovation Agency for Healthcare in Wales. The Royal College of Psychiatrists funded the second, full-scale pilot that was undertaken by the Royal College of Psychiatrists’ Research and Training Unit.

  • Received June 18, 2007.
  • Revision received December 9, 2007.
  • Accepted December 14, 2007.


View Abstract