Biometrics Research Unit, Columbia University, New York, New York
MedAvante, Inc., Madison, Wisconsin, USA
Correspondence: Janet B. W. Williams, MedAvante, Inc., 100 American Metro Blvd., Suite 106, Hamilton, NJ 08619, USA. Email: jwilliams{at}medavante.net
None. Funding detailed in Acknowledgements.
|
|
|---|
The Montgomery–Åsberg Depression Rating Scale (MADRS) is often used in clinical trials to select patients and to assess treatment efficacy. The scale was originally published without suggested questions for clinicians to use in gathering the information necessary to rate the items. Structured and semi-structured interview guides have been found to improve reliability with other scales.
Aims
To describe the development and test–retest reliability of a structured interview guide for the MADRS (SIGMA).
Method
A total of 162 test–retest interviews were conducted by 81 rater pairs. Each patient was interviewed twice, once by each rater conducting an independent interview.
Results
The intraclass correlation for total score between raters using the SIGMA was r=0.93, P<0.0001. All ten items had good to excellent interrater reliability.
Conclusions
Use of the SIGMA can result in high reliability of MADRS scores in evaluating patients with depression.
|
|
|---|
Montgomery–Åsberg Depression Rating Scale
The MADRS was developed in the late 1970s from items that were found in
several studies to be sensitive to change with anti-depressant
treatment.3 Since
its publication the scale has become increasingly popular worldwide.
Dissatisfaction with the leading alternative, the Hamilton Rating Scale for
Depression (HRSD), has further contributed to the popularity of the
MADRS.4
The importance of reliability of assessments in a clinical trial cannot be overestimated. Without good interrater agreement the chances of detecting a difference in effect between drug and placebo are significantly reduced. Muller & Szegedi demonstrated that as the reliability of a rating scale decreases from 0.8 to 0.5, the power of the test to detect a significant difference between drug and placebo drops from 71% to 51%, increasing the risk of type II error.5 In general, total scale score reliability of the most commonly used depression rating scales such as the MADRS and HRSD is high, with or without the use of a structured interview guide.3,6 However, as compounds have become targeted to specific symptoms and clinical trials have revealed specific drugs effects on clusters of symptoms,7,8 it has become more important to be able to depend on the reliable measurement of an individual symptom or a subgroup of symptoms. Self-report versions of clinician-administered scales have been developed9,10 that show high degrees of correlation with the clinician versions; however, there are limited empirical data on their signal detection relative to the clinician in placebo-controlled trials.
There is minimal information available concerning the interrater reliability of the MADRS. The original article4 reported excellent agreement between rater pairs, but only as conjoint reliability, and in only 11 patients. Maier et al reported total score intraclass coefficients (ICCs) of 0.73, 0.66 and 0.82 in three separate samples, using joint interviews in the first sample, and independent interviews in the second and third samples (which were actually the same patients pre- and post-treatment).11 Unfortunately item reliability was not provided, although the authors did report that three of the MADRS items (inner tension, lassitude and suicidal thoughts) had ICCs lower than 0.60 in all three samples. Davidson et al tested the reliability of the MADRS in 44 people receiving in-patient treatment for depression, using an experienced research nurse and a psychiatrist as joint interviewers.12 Overall agreement was acceptable and ranged from fair to good on individual items. More recently, a Japanese version was developed and tested in Japan in joint interviews on a sample of seven patients with DSM–IV major depressive disorder.13 Individual-item ICCs were in the very good to excellent range; however, the weakness of the testing method (small sample size and repeated assessment of the same patients by the same three raters in joint interviews) compromised the significance of the results. Therefore, there is reason to believe that the interrater reliability of the MADRS in a typical research study could be improved.
The MADRS was originally published without suggested questions for clinicians to use in gathering the information necessary to rate the ten items. However, several studies have found that using a structured or semi-structured interview guide improves reliability on similar rating scales.14,15 Moberg et al compared independent interviews using the standard unstructured HRSD with the Structured Interview Guide for the HRSD (SIGH–D) and concluded that the SIGH–D produced uniformly higher item- and summary scale reliabilities than the unstructured HDRS.16 Further, in one placebo-controlled antidepressant trial, raters who more closely adhered to a semi-structured interview guide were found to have better signal detection than raters who did not.2 Such an interview guide provides some assurance that raters across clinical trial sites administer the scale in approximately the same way. Structured interview guides also facilitate training in the use of a scale by providing new raters with explicit instructions and specific interview questions that have been derived from expert interviews. Structured interview guides have become fairly standard for diagnostic interviews,16 as well as for many rating scales, including the Hamilton scales for depression (SIGH–D)14 and anxiety (SIGH–A).17,18 In general, they are designed to approximate an expert administration of the scale.
Development of SIGMA probes and conventions
A semi-structured interview guide similar to the SIGH–D was
originally developed by J.B.W.W. for the MADRS in 1988 and has undergone
several revisions since then, based on user experience and feedback from
raters. More recently, K.A.K. joined as co-author in a major overhaul of the
interview guide. The Structured Interview Guide for the MADRS (SIGMA) provides
structured probes to ensure standardisation of administration and
comprehensiveness of coverage of the ten items of the scale. The SIGMA
questions were developed to obtain the information needed to assess each of
the items anchor points (see Appendix). Each item begins with questions
in bold type that should be asked exactly as written. Often these questions
will elicit enough information about the severity and frequency of a symptom
for the item to be rated with confidence. Follow-up questions are provided,
however, for use when further exploration or additional clarification of
symptoms is necessary. The specified questions should be asked until the rater
has enough information to rate the item confidently. Raters are also
encouraged to add their own probes as necessary to obtain enough information
to rate each item accurately.
In the SIGMA the original MADRS appears on the right-hand side of the page and the structured interview guide questions appear on the left. The interview guide begins with the overview, which is a brief explanation of the time period to be covered, and initial questions to allow some rapport to develop and to give the interviewer some sense of the context of the interviewees current situation. The interview guide then follows, with questions for each of the ten MADRS items.
In the SIGMA the only change that was made to the original MADRS was to reverse the order of administration of the first two items (apparent sadness and reported sadness). There was consensus from users that asking about reported sadness first made for a more logical flow to the interview. Direct probes were added to the apparent sadness item to supplement the raters observation (e.g. In the past week, do you think you have looked sad or depressed to other people?) The rationale for these additional probes was that without the aid of an informant who has seen the patient over the past week it is difficult to rate the persistence and depth of this item based on observation during the interview alone. This technique has been used successfully in self-report and telephone-administered versions of the MADRS,10,19 as well as in computerised and paper-and-pencil self-report versions of the HRSD20,21 and the Hamilton Rating Scale for Anxiety.22 Raters are instructed to consider both sources of information (direct observation and self-report) in rating this item.
In the interview guide there is an emphasis on open-ended questions, to encourage respondents to describe their experience in their own words, and to avoid raters putting words in the persons mouth. Thus, for example, instead of asking the person at the beginning of the interview, Have you been feeling sad or unhappy?, the enquiry begins, How have you been feeling since last [day of week]? Likewise, instead of asking whether the person has had trouble sleeping in the past week, the sleep item begins, How has your sleeping been in the past week? Some items are assessed more directly, to improve the efficiency of the interview. For many responses the person is asked to provide examples; for instance, if there is a positive response to the question, Have you had trouble concentrating or collecting your thoughts in the past week? the interviewer is instructed to ask, Can you give me some examples? Once the person has described the symptom in his or her own words, the interviewer can then decide whether concentration difficulty is truly present, which would be rated in this item.
Once the revised SIGMA was completely drafted, revisions were made based on feedback from a number of users in the field, and the instrument was finalised. This report describes a formal assessment of the test–retest reliability of the 2006 version of the SIGMA, which is presented in full in the Appendix.
|
|
|---|
There is growing interest in the use of remote technologies for delivering assessments in clinical trials.23 Therefore, of the 81 pairs of interviews, 30 pairs were done using two face-to-face interviews, 30 pairs were done using one face-to-face and one videoconference interview, and 21 pairs were done using one face-to-face and one telephone interview. To control for the confounding influence of participant differences on mode of administration, the same 30 people were used in the face-to-face v. face-to-face and the video v. face-to-face cohorts. To minimise memory effects these cohorts were interviewed on different days (1–3 days apart) and no rater ever saw the same patient twice.
Fifty-one participants (14 men and 37 women; mean age 43 years, s.d.=12.35, range 20–72) with a mood disorder diagnosed according to DSM–IV criteria were included.24 The diagnoses were major depression, n=27; major depression in partial remission, n=15; minor depression, n=2; dysthymia, n=1; bipolar disorder (depressed), n=2; and depression not otherwise specified, n=4. Diagnoses were determined using a modified version of the mood module of the Mini International Neuropsychiatric Interview (MINI)25 and the overview section of the Structured Clinical Interview for DSM–IV (SCID).16 Since previous versions of the SIGMA are already widely used in clinical trials, a range of mood disorders was included in order to evaluate the reliability of the SIGMA across a wide range of symptom severity, including patients in partial recovery. The sample was 82% White, 10% African American, 2% Hispanic and 6% other. About half (47%) had a college degree. Participants were recruited from the Madison, Wisconsin area in response to advertisements in a local newspaper looking for people who were currently or had recently experienced symptoms of depression. Respondents were screened over the telephone by a research assistant for gross exclusions (current or lifetime schizophrenia, current psychosis or acute suicidal ideation) and were then scheduled for further follow-up screening with a clinician. All participants signed informed consent statements approved by the Allendale institutional review board, and were paid US $50.
The rater cohort consisted of two male and four female interviewers. Five had doctoral degrees (four in psychology and one in social work), and one a masters degree in counselling psychology. Prior to the study, raters underwent reliability training on the scale, consisting of a didactic review of the scale and scoring conventions, followed by at least three practice interviews observed by a trainer (two group and one individual). Raters also observed each others training sessions in order to enhance learning. Raters had a range of prior experience with the MADRS, ranging from extensive (J.B.W.W., K.A.K.) to minimal. All raters interviewing skills were evaluated using the Raters Applied Performance Scale (RAPS)26 and all raters scored at least good on all dimensions before the study began. Raters were paired using numerous permutations of dyads, to maximise generalisability. Order was counterbalanced, so that raters conducted an equal number of first and second interviews.
|
|
|---|
The intraclass correlation for total score between raters conducting MADRS interviews using the SIGMA was r=0.93 (P<0.0001, 95% CI 0.89–0 .95). In addition, there was no significant difference between the mean MADRS scores obtained by the first interviewer and the second interviewer: 20.49 (s.d.=10.5) v. 20.65 (s.d.=10.6); mean difference 0.16 points (t(80)=0.36, P=0.72). Internal consistency reliability (coefficient alpha) was also examined. Similar levels of internal consistency were found for interviews administered by the first interviewer (r=0.90) and those done by the second interviewer (r=0.90), z=–0.19, P=0.85; the 95% confidence interval for the difference (d(r)=0.0057) was –0.50 to 0.12. An examination of the interrater reliability (ICC) at the item level is presented in Table 1. As Blacker & Endicott have indicated, it is sometimes said that an [ICC] above 0.8 can be considered excellent, 0.7–0.8 good, 0.5–0.7 fair, and less than 0.5 poor (page 9).27 All of the ten MADRS items using the SIGMA had good to excellent interrater reliability, with more than half of them in the excellent range, as was the total score.
|
View this table: [in a new window] | Table 1 Intraclass correlations between raters on individual items of the Montgomery—Åsberg Depression Rating Scale, by mode of administration |
The correlation (ICC) between the SIGMAs administered by videoconference and those administered face-to-face was r=0.95 (P<0.0001, 95% CI 0.89–0.97). The correlation between SIGMAs administered by telephone and those given face-to-face was r=0.90 (P<0.0001, 95% CI 0.78–0.96) and that between the 30 pairs of face-to-face interviews was r=0.93 (P<0.0001, 95% CI 0.87–0.97). There was no statistically significant difference among these correlations: z=0.45, P=0.66, 95% CI for the difference (d(r)=0.0141) –0.09 to 0.98 for two face-to-face v. video and face-to-face; z=0.07, P=0.95, 95% CI for the difference (d(r)=0.0048) –0.59 to 0.729 for two face-to-face v. telephone and face-to-face.
We also took this opportunity to examine the diagnostic accuracy of the MADRS in differentiating major depression from the other diagnoses. Using a cut-off score of 18 on the MADRS, the sensitivity of the scale for the diagnosis of major depression was 87%, its specificity was 61%, the positive predictive value was 74% and the negative predictive value 79%.
Interview length
The mean length of interviews conducted with the SIGMA was 25.8 min
(s.d.=10.04, range 5–56). The mean interview length for the interview
conducted second was 2.4 min shorter than the mean length of the interview
conducted first (26.9 min v. 24.5 min;
t(78)=2.38, P=0.020).
|
|
|---|
To our knowledge, this is the first assessment of the reliability of the MADRS in which all test–retest interviews were independent, and in which agreement at the item level was reported. In this study agreement on the total MADRS score was in the excellent range, and the reliability of all ten of the items was good to excellent. Our results also support the equivalence of remote administration of the MADRS using the SIGMA, by both telephone and videoconference, to face-to-face interviews. This finding has important implications for the way in which clinical trial assessments are conducted: remote assessments can offer more efficient and flexible administration paradigms than face-to-face assessments.
This study has demonstrated that with the use of the SIGMA, a group of interchangeable raters can achieve high reliability of MADRS total and item scores in a range of patients with depression. The extent to which this improvement in interrater agreement translates into improved signal detection in trials using the MADRS remains to be demonstrated.
|
View this table: [in a new window] |
Appendix Structured Interview Guide for the Montgomery–Åsberg Depression
Rating Scale (SIGMA)
|
|
|
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||