Background The Patient Health Questionnaire (PHQ) is the most commonly used measure to screen for depression in primary care but there is still lack of clarity about its accuracy and optimal scoring method.
Aims To determine via meta-analysis the diagnostic accuracy of the PHQ-9-linear, PHQ-9-algorithm and PHQ-2 questions to detect major depressive disorder (MDD) among adults.
Method We systematically searched major electronic databases from inception until June 2015. Articles were included that reported the accuracy of PHQ-9 or PHQ-2 questions for diagnosing MDD in primary care defined according to standard classification systems. We carried out a meta-analysis, meta-regression, moderator and sensitivity analysis.
Results Overall, 26 publications reporting on 40 individual studies were included representing 26 902 people (median 502, s.d.=693.7) including 14 760 unique adults of whom 14.3% had MDD. The methodological quality of the included articles was acceptable. The meta-analytic area under the receiver operating characteristic curve of the PHQ-9-linear and the PHQ-2 was significantly higher than the PHQ-9-algorithm, a difference that was maintained in head-to-head meta-analysis of studies. Our best estimates of sensitivity and specificity were 81.3% (95% CI 71.6–89.3) and 85.3% (95% CI 81.0–89.1), 56.8% (95% CI 41.2–71.8) and 93.3% (95% CI 87.5–97.3) and 89.3% (95% CI 81.5–95.1) and 75.9% (95% CI 70.1–81.3) for the PHQ-9-linear, PHQ-9-algorithm and PHQ-2 respectively. For case finding (ruling in a diagnosis), none of the methods were suitable but for screening (ruling out non-cases), all methods were encouraging with good clinical utility, although the cut-off threshold must be carefully chosen.
Conclusions The PHQ can be used as an initial first step assessment in primary care and the PHQ-2 is adequate for this purpose with good acceptability. However, neither the PHQ-2 nor the PHQ-9 can be used to confirm a clinical diagnosis (case finding).
Declaration of interest None.
Copyright and usage © The Royal College of Psychiatrists 2016. This is an open access article distributed under the terms of the Creative Commons Non-Commercial, No Derivatives (CC BY-NC-ND) licence.
Major depressive disorder (MDD) is a serious, disabling condition that is often comorbid with other medical presentations.1–4 Most care for depression is delivered by general practitioners (GPs) and individually many GPs have considerable experience in managing depression.5 Approximately 7% of all consultations in primary care are for depression.6 Yet, clinicians find it challenging to precisely diagnose depression and often overestimate or underestimate levels of distress of their patients sometimes resulting in false-positive or false-negative diagnoses.7 Indeed, GPs are typically able to detect about half of true cases of depression on a one-off visit1 and once diagnosed not all patients with depression receive adequate timely care.8 Although under-detection can lead to inadequate treatment,9 over-detection (misidentification) can lead to inappropriate treatment.9,10 For example, in the Baltimore Epidemiologic Catchment Area Study, 38% of antidepressant users never met the criteria for MDD, obsessive–compulsive disorder, panic disorder, social phobia or generalised anxiety disorder in their lifetime.10 Mitchell et al1 suggested that this could become a particular problem in routine care where prevalence rates are modest when false positives can outnumber false negatives.
Given that many clinicians have highlighted the difficulties in the timely diagnosis of depression11 and that depression care is often inadequate,12–14 the use of screening tools in routine care has been suggested by some as possibly beneficial by enhancing diagnosis-as-usual. Screening is most usefully defined as the systematic application of a test to rule out those without a condition and case finding most usefully defined as the systematic application of a test to confirm those with a condition.15 Screening and case finding have been proposed as solutions adopted into the UK primary care quality outcomes framework (QoF).16 The use of short screening questionnaires (<5 min) and ultra-short questionnaires (<2 min) may improve the recognition of depression if such tests are accurate, acceptable and implemented.17,18 Of all the possible tools for depression, the depression module of the Patient Health Questionnaire (PHQ-9) is the most popular current tool which has three main formats.
The PHQ-9 (PHQ-9-linear) scored by simple addition and at a threshold of 10 or higher had a sensitivity of 88% and a specificity of 88% for detecting MDD in the initial validation study.19
The PHQ-9 (PHQ-9-algorithm) scored by the algorithm suggested in DSM-IV for MDD (the DSM algorithm method requires at least five symptoms rated as at least 2 (more than half the days) (>0 for the suicidal ideation item) plus at least one of the symptoms scored as at least 2 is either loss of interest or pleasure or depressed mood all present for 2 weeks or more and associated with distress or dysfunction). As this follows the rules of DSM-IV more precisely, it is anticipated that this method should be the most accurate.
The PHQ-2 is the 2-item version utilising only the first two questions, namely loss of interest and low mood for the past 2 weeks, scored by simple linear scoring using a threshold of 2 or higher.20
An adaptation of the PHQ-2 also exists where the main modification is the duration of questioning which is over the past month rather than 2 weeks. This is known as the Whooley questions after the original author.
Yet, it is important to acknowledge that the value of screening and severity assessment has been disputed both in the literature and in clinical practice. Some authors have stated that routine use of depression tools should identify patients with either previously unrecognised MDD or untreated MDD (in effect a demonstration of added value)21 but policy recommendations and guidelines have been inconsistent. In 2009, the United States Preventive Services Task Force (USPSTF) recommended routine depression screening in primary care settings with follow-up.22 This recommendation has recently been revised and extended.23 In the UK, the national guidelines have reversed their advice24 and the most recent draft guidance state there is little convincing evidence that depression screening will reduce the number of patients with depression or improve depression symptoms.25 GPs in the UK have been less enthusiastic than patients about routine use of depression scales,26 leading to the removal of depression screening incentives from the UK QoF. In 2013, the Canadian CTFPHC reconsidered its earlier guideline and also recommended against screening adults for depression in primary care settings.27 Thus, although some still advocate screening for depression, others do not and the argument has become polarised.23 Few are putting forth the argument that screening might work in some circumstances or that further evidence is required from high-quality studies, leading to observers to suggest that this is a form of confirmation bias from either side defending an entrenched position.28
Four previous meta-analyses have been conducted on the accuracy of the PHQ-9 but none have specifically been conducted in primary care.29–32 One previous meta-analysis has been conducted on the PHQ-2/Whooley questions but is considerably out of date.17 Thus, the primary objective is to conduct a meta-analysis to determine the diagnostic accuracy of the PHQ-9-linear, PHQ-9-algorithm and PHQ-2 questions to detect MDD among adults.
This systematic review was conducted following a predetermined but unpublished protocol.
Inclusion and exclusion criteria
We included studies that reported the accuracy of PHQ-9/PHQ-2 questions for diagnosing MDD in primary care. The setting had to be mostly primary care (but not exclusively, containing >50% of primary care patients) and we identified one study in two publications with mixed recruitment.33,34 Studies focusing on one single medical condition in primary care were excluded.35 The studies had to provide sufficient data to allow us to calculate contingency tables or had to be supplied by authors. We only included studies that defined MDD according to standard classification systems such as the ICD or the DSM using a standardised diagnostic interview schedule (Mini International Neuropsychiatric Interview (MINI), Structured Clinical Interview for DSM Disorders (SCID), Composite International Diagnostic Interview (CIDI), Diagnostic Interview Schedule (DIS) or Revised Clinical Interview Schedule (CIS-R)).
Information sources and searches
Two independent reviewers searched Embase, Web of Science, PsycINFO, CINAHL Plus and PubMed from 1998 until June 2015. We used the key words ‘PHQ’, ‘patient health questionnaire’, ‘screening’, ‘depression’, ‘MDD’, ‘primary care’ and ‘general practice’.
We collected information about study characteristics and quality using a standardised data collection form. We included the following characteristics: setting, country, age of sample, gender of sample, year of study, sample size, masking of the assessor of the reference test, data integrity, cut-off score and translation of non-English versions of PHQ-9. When an article appeared to meet the criteria but did not contain sufficient data, we contacted the authors up to two times a month.
After the removal of duplicates, two independent reviewers screened the titles and abstracts of all potentially eligible articles. Both authors applied the eligibility criteria, and a list of full text articles was developed through consensus. The two reviewers then considered the full texts of these articles and the final list of included articles was reached through consensus. A third reviewer was available for mediation throughout this process.
Methodological quality assessment
We used the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool to assess risk of bias factors in primary studies, and these factors will be included as study-level variables in analyses.36 The updated QUADAS-2 guidelines stipulate that it should be adapted for each specific review. We employed the QUADAS-2 adaptation utilised in a recent generic PHQ meta-analysis.30 The QUADAS-2 incorporates assessments of risk of bias across four core domains: patient selection, the index test, the reference standard and the flow and timing of assessments. Two reviewers independently assessed risk of bias with any discrepancies resolved by consensus. Two reviewers also independently assessed outliers that may be qualitatively different in study design.
Meta-analysis and proposed subgroup analysis
A pooled meta-analysis of suitable studies was conducted to identify overall test accuracy, sensitivity, specificity, combined Youden score, positive and negative predictive values (PPV/NPV), positive and negative likelihood ratios (LR+/LR−) and positive and negative clinical utility index (CUI+/CUI−). Further details are available at www.clinicalutility.co.uk. The CUI is a proxy for the applied value of a test with a qualitative as well as quantitative interpretation.37–39 Clinical utility may be more important to clinicians than validity.40 Clinical utility estimates the clinical value of a diagnostic test taking into account both the accuracy of the test and its occurrence. The positive utility index (for rule-in or case-finding accuracy) is a product of sensitivity and PPV and the negative utility index (for rule-out or screening accuracy) is a product of Sp x NPV. The interpretation of the CUI is 0.93–1.00 near perfect value, 0.81–0.92 excellent, 0.64–0.80 good, 0.49–0.63 fair, 0.36–0.48 poor and <0.36 very poor.
Sensitivity and specificity are generally regarded as intrinsic characteristics of a test and independent of prevalence and are a useful initial metric, but these measures do not reflect clinical practice or inform clinicians how to interpret a positive or negative test.41 Summary measures of diagnostic accuracy typically use receiver operating characteristic (ROC) curve analysis, by which sensitivity and specificity linked with all possible cut-off scores were calculated and plotted.42 For an individual study, an optimal cut-off score is chosen which balances sensitivity and specificity. ROC curve data are a proportion with a confidence interval which can be combined across all qualifying studies. From the supplied data, we constructed 2 × 2 tables for each cut-off score and computed any missing values. For completeness, we also performed a bivariate meta-analysis to obtain pooled estimates of specificity and sensitivity and their associated 95% confidence intervals (CIs). We constructed summary ROC curves using the bivariate model to produce a 95% confidence ellipse within the ROC curve space. Each data score in this space represents a separate study. We also constructed a Bayesian plot of conditional probabilities which shows all PPVs and NPVs across every possible prevalence.
We assessed between-study heterogeneity using the I2 statistic43 which describes the percentage of total variation across studies that is caused by heterogeneity rather than chance. As per convention, we considered an I2 value of 25% to be low, 50% to be moderate and 75% to be high. We explored the causes of heterogeneity if there was significant between-study heterogeneity. Publication bias was assessed by Harbord or Egger methods.44
For a secondary moderator analysis, we performed sub-analysis in clinically relevant subgroups such as those studies with a head-to-head comparison of tools in the same sample. We also attempted a logistic meta-regression analysis of diagnostic accuracy using the 50th percentile of Youden score (sum of sensitivity and specificity) using covariates in the meta-regression model.45 We investigated heterogeneity resulting from the characteristics of the sample or study design by exploring the effects of potential predictive variables.
The initial search yielded 777 hits. After removal of duplicates, 621 abstracts and titles were screened (Fig. 1). At the full-text review stage, 58 articles were considered and 32 were subsequently excluded, leaving 26 publications and 40 different analyses that were included in the review.19,32,33,46–68 Details regarding the search results, including reasons for exclusion of articles are summarised in Fig. 1.
Study and participant characteristics
Details of the included studies are summarised in Table 1. Briefly, 11 studies examined the PHQ-9, 9 examined the PHQ-9-algoithm and 20 examined the PHQ-9-linear. Several studies compared diagnostic methods within the same population, allowing a head-to-head comparison. Of particular interest, Thompson & Higgens,45 Manea et al32 and Lowe et al33 compared all three diagnostic methods. Chen et al, Kroenke et al, Liu et al, de Lima Osório et al, 2009, Phelan et al and Richardson et al compared the PHQ-2 with the PHQ-9-linear.19,50,52,57,60,61 Lamers et al, Lotrakul et al, Wittkampf et al and Zuithoff et al compared the PHQ-9-algorithm with the PHQ-9-linear. In these head-to-head studies, the cut-off thresholds were consistent, namely PHQ-2 (linear) ≥2 and PHQ-9 (linear) ≥10.56,58,66,68
The total sample size was 26 902 (median 502, s.d.=693.7) with a mean patient age of 49.38 years, and 61% were female. There were 23 706 individuals without depression according to the criterion reference and 3009 with depression, meaning that the prevalence of depression in primary care was 11.3% (95% CI 10.92–11.68%) from simple pooling of data. However, as several publications used multiple tests, after limiting the analysis to unique adults, there were 14 760 people, of whom 2117 had depression (14.3%; 95% CI 11.3–17.7).
Supplementary Table DS1 summarises the QUADAS-2 scores for all of the included studies. Only four studies were judged low risk of bias across all four domains.33,45,55,59 Three studies had either high risk of bias or were considered possible outliers. Richardson et al,61 utilised adolescents seen in primary care; Whooley et al,65 used the Whooley questions and was eventually excluded; finally Cannon et al,48 used lifetime risk of depression rather than current depression (although this did not significantly influence the recorded prevalence levels). We used this information as a moderator analysis.
Diagnostic accuracy of the PHQ
Sensitivity and specificity meta-analysis
Main analysis. The diagnostic validity meta-analysis gave overall sensitivity estimates of 82.2% (95% CI 74.3–88.9), 58.4% (95% CI 44.5–71.7) and 89.9% (95% CI 83.4–94.9) for the PHQ-9-linear, PHQ-9-algorithm and PHQ-2 respectively. In all cases, there was significant heterogeneity but no significant publication bias (see Table 2 which contains the heterogeneity and publication bias data for all of the pooled analysis). The pooled specificity was 84.7% (95% CI 80.4–88.5), 92.1% (95% CI 85.9–96.6) and 72.6% (95% CI 66.0–78.7) for the PHQ-9-linear, PHQ-9-algorithm and PHQ-2 respectively. In the sensitivity analysis (in which we removed the three outliers) and in the bivariate analysis, the results were broadly unchanged (Table 3 and Fig. 2) but they did generate our best estimate of sensitivity of 81.3% (95% CI 71.6–89.3) and specificity of 85.3% (95% CI 81.0–89.1) for the PHQ-9-linear; a best estimate of sensitivity of 89.3% (95% CI 81.5–95.1) and specificity of 75.9% (95% CI 70.1–81.3) for the PHQ-2; a best estimate of sensitivity of 56.8% (95% CI 41.2–71.8) and specificity of 93.3% (95% CI 87.5–97.3) for the PHQ-9-algorithm.
Subanalysis (head to head) PHQ-9-linear v. PHQ-2. In a subanalysis restricted to head-to-head studies on the same population, the sensitivity of the PHQ-9-linear was 87.0% (95% CI 75.81–95.07) v. 91.4% (95% CI 83.60–96.92) for the PHQ-2. The specificity of the PHQ-9-linear was 87.17 (95% CI 81.10–92.20) v. 72.23% (95% CI 63.96–79.81) for the PHQ-2. In the sensitivity analysis, the results were unchanged (Table 3).
Subanalysis (head to head) PHQ-9-linear v. PHQ-9-algorithm. In a subanalysis restricted to head-to-head studies on the same population, the sensitivity of the PHQ-9-linear was 81.1% (95% CI 63.34–93.86) v. 53.1% (95% CI 36.44–69.31) for the PHQ-9-algorithm. The specificity of the PHQ-9-linear was 86.34% (95% CI 80.36–91.38) v. 95.71% (95% CI 93.54–97.45) for the PHQ-9-algorithm, suggesting significantly lower specificity for the PHQ-9-linear. However, caution is necessary as these results are from a predefined cut-point of >10. Results were broadly unchanged in the sensitivity analyses (Table 3).
Cut-off analysis: effect of cut-off thresholds. In an analysis restricted to specific cut-offs, we analysed the effect of choosing different fixed cut-off thresholds on the PHQ-2 and PHQ-9 when scored using linear scoring. Results are shown in Table 3. Inevitably, as the cut-point increased sensitivity reduced and specificity increased. For the PHQ-9, looking at combined sensitivity and specificity (Youden index), the optimal cut-off would be ≥10 followed by ≥11.
Moderator analysis: effect of influencing variables. In a moderator analysis we found no association between country, mean age, gender, year of publication or sample size.
ROC curve meta-analysis
Main analysis PHQ-9 linear, PHQ algorithm and PHQ-2. The pooled ROC diagnostic validity meta-analysis gave an overall area estimate of 0.91 (95% CI 0.892–0.930) for the PHQ-9-linear, 0.733 (95% CI 0.676–0.795) for the PHQ-9-algorithm and 0.860 (95% CI 0.819–0.903) for the PHQ-2. In all cases there was significant heterogeneity but no significant publication bias; see Table 3 (summary of results). Results were broadly unchanged in moderator analysis with area under the ROC of 0.910 (95% CI 0.882–0.939) for the PHQ-9 linear, 0.732 (0.667–0.803) for the PHQ-9-algorithm and 0.877 (0.824–0.934) for the PHQ-2.
Subanalysis (head to head) PHQ-9-linear v. PHQ-2. The area under the ROC for the PHQ-2 was 0.898 (95% CI 0.864–0.933) and 0.922 (95% CI 0.882–0.964) for the PHQ-9-linear in the head-to-head studies. Once again, results were unchanged in the sensitivity analysis.
Subanalysis (head to head) PHQ-9-linear v. PHQ-9-algorithm.The area under the ROC for the PHQ-9-linear was 92.01 (95% CI 91.53–92.48) and 71.49 (95% CI 62.75–81.45) for the PHQ-9-algorithm when restricted to four head-to-head studies.
Subanalysis (head to head) PHQ-2 v. PHQ-9-algorithm. There were insufficient data for this comparison.
Test performance: case finding v. screening
Examining PPV, the diagnostic validity meta-analysis suggested superior PPV of the PHQ-9-algorithm 47.4% (95% CI 44.7–50.1) compared with the PHQ-2 23.1% (95% CI 21.5–24.7); however, caution is required because prevalence is not controlled for (i.e. not matched in both analyses) (correction for prevalence is shown in the Bayesian curve of conditional probabilities). Examining NPV, meta-analysis suggested superior PPV of the PHQ-2 97.5% (95% CI 97.0–97.9) compared with the PHQ-9-algorithm 94.8% (95% CI 94.3–95.3); however, caution is again required because prevalence is not controlled for in this analysis. Results using likelihood ratios are shown in Table 3 but more informative is the clinical utility. For case finding (CUI+), all methods were disappointing, with the following results: PHQ-9-linear 0.312 (95% CI 0.311–0.313), PHQ-9-algorithm 0.277 (95% CI 0.276–0.278) and PHQ-2 0.188 (95% CI 0.187–0.189), all suggesting very poor performance at typical prevalence rates seen in primary care. Results were not substantially different using a moderator analysis for high-quality studies with a fixed cut-off or using head-to-head analysis.
For application as a screening test (CUI+) all methods were satisfactory with the following results: PHQ-9-linear 0.827 (95% CI 0.826–0.828), PHQ-9-algorithm 0.873 (95% CI 0.873–0.873) and PHQ-2 0.708 (95% CI 0.707–0.709), all suggesting good to excellent performance at typical prevalence rates seen in primary care. Results were not substantially different using a moderator analysis for high-quality studies with a fixed cut-off or using head-to-head analysis. All analyses suggested the optimal rule-out screening test would be the PHQ-9-algorithm, closely followed by the PHQ linear.
Using a Bayesian curve of conditional probabilities, the performance of each test (judged by PPV and NPV) can be demonstrated at every possible prevalence applicable to different settings (Figs. 3 and 4). From the Bayesian curve, the most encouraging test would be the PHQ-2 used as an initial screener followed by either the PHQ-9-linear or another suitable case-finding tool.
Cut-off analysis: effect of cut-off thresholds
On the PHQ-9, looking at combined PPV and NPV (predictive summary index), the optimal cut-point would be ≥14. For the PHQ-2, looking at combined sensitivity and specificity (Youden index), the optimal cut-off would be ≥3 closely followed by ≥2 (note that ≥2 is the conventional threshold). However, looking at combined PPV and NPV (predictive summary index), the optimal cut-point would be ≥6, followed by the ≥5. Comparing the PHQ-9 and the PHQ-2 across all possible cut-offs shows that neither is satisfactory as a case-finding tool in primary care at any cut-off, but the optimal single method is the PHQ-2 at a threshold of ≥6. Six per cent of those without MDD have a score of 5 or lower on the PHQ-2 and of those with a score of 5 or lower, 93.5% are true negatives (true non-cases) (Tables 2–4).
A previous meta-analysis of 41 studies involving 50 371 individuals in primary care found a pooled prevalence of 18.4% (95% CI 13.5–23.9) in adults aged 18–65 years using semi-structured interviews.1 In this study, we found a slightly lower prevalence of depression in primary care of 14.3% (95% CI 11.3–17.7%) across 14 760 adults. The PHQ-9-linear had better sensitivity but worse specificity than the PHQ-9-algorithm. However, this finding could result from choosing a PHQ-9-linear cut-off threshold which is too low. Regarding the PHQ-2, it had significantly greater specificity over the PHQ-9-linear method. Analysis using the ROC meta-analysis suggested that the area under the ROC of the PHQ-9-linear as well as that of the PHQ-2 was significantly higher than the PHQ-9-algorithm which was surprising given that the PHQ-9-algorithm more tightly adheres to the DSM criterion standard. The difference was maintained when PHQ-9-linear and PHQ-9-algorithm were compared with analysis was restricted to four head-to-head studies. In head-to-head studies, the tools are tested against one another in the same sample, ruling out differences according to prevalence or local conditions. Using the same methods, there was no clear differences between the PHQ-2 and PHQ9-linear which again is surprising given the brevity of the PHQ-2.
However, these results do not clarify a specific role for any method in either screening or case finding. For case finding, consistent with previous literature, all methods were disappointing with the results on the CUI+ graded as ‘very poor’. Looking at PPV alone for all methods using the Bayesian curve, results were similarly poor thus confirming overall poor performance of this method at typical prevalence rates seen in primary care. In short, a positive test is infrequent in a typical primary care sample and/or a positive test (when it does occur) is not especially discriminating. For application as a screening test all methods were encouraging with the following results on the CUI−: PHQ-9-linear 0.827 (0.826–0.828), PHQ-9-algorithm 0.873 (0.873–0.873) and PHQ-2 0.708 (0.707–0.709), all suggesting good to excellent performance at typical prevalence rates. In NPV, values were all high. Examining this effect in more detail using a Bayesian curve of conditional probabilities demonstrated (Figs 3 and 4) that although none of the methods performed particularly well at case finding at any prevalence rate when used alone, they performed reasonably well at initial first step. The most practical use of these tools would be the PHQ-2 used as an initial screener followed by either the PHQ-9-linear or another suitable case-finding tool.
We also analysed the effect of varying the cut-point. If simply considering sensitivity and specificity, then the cut-point analysis suggested that the current thresholds of ≥10 on the PHQ-9 and ≥2 on the PHQ-2 are very close to optimal. However, as discussed above there is more to the application of tests in clinical practice than simply looking at combined sensitivity and specificity. Clinical utility is better represented by PPV and NPV. Using PPV and NPV (combined) suggests that a substantially higher cut-point in both the PHQ-9 and the PHQ-2 may be appropriate. Furthermore, if one discounts their role in case finding and simply concentrates on rule-out ability (CUI–), then the optimal cut points would be ≥14 on the PHQ-9 and ≥6 on the PHQ-2. Although these high thresholds are surprising, it is evident that those without MDD, 98.6% have a score of 5 or lower on the PHQ-2 and of those with a score of 5 or lower, 93.5% are true negatives (true non-cases). Similarly, 96.5% of non-cases scored <14 on the PHQ-9 and of those that do, 92.9% are true negatives. We suggest further work is required to examine the optimal cut-off thresholds if a two-step procedure were to be used.
We acknowledge that there were relatively few studies with all the required subgroups and not all studies reported ROC data (but we were able to calculate this in many cases). To date, studies have not attempted to clarify whether the sample comprises previously untreated or previously undiagnosed patients. We did not attempt to look at severity assessment or sensitivity to change. It must also be acknowledged that the results presented represent the outcome of a single application of the PHQ. Multiple (serial) applications may be conducted in clinical practice and would change results. For completeness, if the PHQ-2 is initially applied (step 1), followed by PHQ-9-linear to those who score positive in step 1, then the combined sensitivity would be 72.4% and specificity 96.4% (overall accuracy 93.0%). If the PHQ-2 were to be initially applied followed by PHQ-9-algorithm, the combined sensitivity would be 50.7% and specificity 98.4% (overall accuracy 91.6%).
Clinical implications and further research
The PHQ has potential to be used to rule out those without depression with few false negatives but an adjustment of the cut-off points (≥14 on the PHQ-9 and ≥6 on the PHQ-2) should be considered. Alternatively its routine use can be improved by a two-step procedure using PHQ-2 and then PHQ-9. This would also reduce the burden on clinicians as the PHQ-9 would only be applied following a positive initial PHQ-2 screen. Depression tools applied for the purpose of screening and/or case finding will only be of use if combined with adequate follow-up and adequate treatment. Screening without removal of barriers to high-quality care is potentially frustrating and arguably counterproductive. Several reviews found modest evidence to support QoF-based PHQ scoring in part because primary care clinicians may lack the skills or resources to appropriately follow-up a positive screen.26,69 Further work on cut-off thresholds and repeat assessment may further improve results but care must be taken not to increase the burden on clinicians if they are required to implement screening tools.
This meta-analysis confirms that neither the PHQ-9 nor the PHQ-2 can confirm a diagnosis of MDD when used alone as a one-off measure and this is independent of the scoring method. However, the PHQ-9 and indeed the PHQ-2 can be used as an initial first screening step and indeed performs quite well in this regard.
We thank Jemma Adams for her help during the revision of this manuscript.
- Received July 3, 2015.
- Revision received December 17, 2015.
- Accepted December 21, 2015.
- © 2016 The Royal College of Psychiatrists
This is an open access article distributed under the terms of the Creative Commons Non-Commercial, No Derivatives (CC BY-NC-ND) licence (http://creativecommons.org/licenses/by-nc-nd/4.0/).