Invasive breast carcinomas with amplification of human epidermal growth factor receptor 2 (ERBB2) gene or overexpression of the corresponding HER2 protein have distinct tumour biology, and HER2 status is a strong predictive biomarker for tumour response to HER2 targeted therapy.1 HER2 status is usually determined by immunohistochemistry (IHC) and fluorescent in situ hybridisation (FISH).2 3 Based on the current American Society of Clinical Oncology/College of American Pathologists (ASCO/CAP) recommendations, HER2 IHC scores are classified as 0, 1+, 2+ and 3+ (table 1).2 4 Breast cancers with 0 or 1+ scores are currently defined as HER2-negative (HER2–) and do not require further evaluation by FISH. Breast cancers with 2+ (equivocal) scores should be subjected to reflex FISH analysis for determining gene amplification status. Breast cancers with either 3+ scores or ERBB2 gene amplification by FISH are currently defined as HER2-positive (HER2+).
Table 1Current ASCO/CAP recommendations for HER2 IHC evaluation
Traditionally, only patients with HER2+ breast cancers have been eligible for HER2 targeted therapy.5–7 Trastuzumab deruxtecan (T-DXd) is an antibody drug conjugate composed of trastuzumab, which recognises HER2 protein on tumour cell surface, and cytotoxic topoisomerase I inhibitor (payload) through an enzyme-cleavable linker.8 9 Following binding of T-DXd to tumour cell surface HER2, T-DXd undergoes internalisation and the linker is cleaved by lysosomal enzymes to release the topoisomerase I inhibitor payload component and inhibit tumour cell growth. The ratio of drug payload-to-antibody is approximately 8 and the release of the drug payload has cytotoxic effects on adjacent tumour cells regardless of their HER2 expression level, known as bystander effect10 In April 2022, T-DXd was granted breakthrough therapy designation by the Food and Drug Administration (FDA) to treat patients with metastatic HER2-low breast cancer (HLBC)11 based on the DESTIN-Breast04 trial which demonstrated improved clinical outcomes with addition of T-DXd to the therapies in patients with metastatic HLBC.12 13 In this trial, HLBC was defined as HER2 IHC scores of 1+ or 2+ and lacking gene amplification by FISH.
With this FDA approval, patients with HLBC can be treated with T-DXd as the targeted therapy. However, the difference between 0 and 1+ or 1+ and 2+ scores can be difficult to discern. There is no established protocol to guide evaluation of low HER2 levels, leading to high interobserver variability. A recent study showed poor interobserver agreement among 18 pathologists with only 26% concordance for 0 and 1+ scores.14 Furthermore, a multicentre international study demonstrated moderate agreement for distinguishing 0 and HER2-low scores, which did not improve by FISH.15 Inaccurate evaluation of HER2 IHC may lead to suboptimal treatment; for example, patients with metastatic HLBC would be ineligible for T-DXd if HER2 IHC were erroneously assessed. Therefore, there is an urgent unmet need to develop a standardised protocol for identifying HLBC.
The overarching goal of this study was to develop a practical algorithm for IHC evaluation of HLBC and examine its impact on interobserver agreement.
Materials and methodsStudy cohortA total of 106 HER2-negative breast carcinomas (67 biopsies and 39 excisions that met ASCO/CAP guidelines for cold ischaemia time and fixation time) were identified in the pathology archives in which HER2 status was assessed using the ASCO/CAP guidelines as part of routine clinical biomarker testing,16 including 60 cases reported as HER2-negative with IHC 1+ scores (weak incomplete staining in >10% of tumour cells) and 46 cases with reported as HER2-negative with IHC 0 scores, the latter group consisting of tumours without any staining and tumours with weak incomplete staining in <10% of tumour cells. HER2 IHC was performed using a laboratory developed test.17–19 Briefly, slides were stained with the HercepTest HER2 antibody (Dako) using an automated Leica Bond III stainer (Leica Biosystems, Deer Park, Illinois, USA) following antigen retrieval in citrate buffer at pH 6.0 for 20 min. The slides were evaluated by six board certified pathologists with subspecialised practice in breast pathology. Of the six pathologists, two were seniors with ≥5 years of practice experience, and four were junior with <5 years of experience.
Development of an algorithm for evaluating HER2 IHC scoresThe proposed algorithm included: (1) evaluation of the whole slide at ×100 magnification to identify staining heterogeneity; (2) in each area with similar staining percentage, evaluation of the % of incomplete membrane staining of any intensity at ×400 magnification using the eyeballing method and (3) calculation of the global HER2 score. The algorithm is summarised in table 2. In tumours with homogeneous distribution of HER2 membrane staining, only two high power field estimations were performed, and the average was used to calculate a final HER2 IHC score. Figure 1 shows an example of evaluating HER2 IHC in cases with heterogeneous distribution of HER2 IHC staining.
Figure 1Schematic (A) and microscopic (B) examples of evaluating HER2 immunohistochemical (IHC) staining in tumours with heterogeneous HER2 protein expression using the eyeballing method. Each blue dot represents a tumour cell with incomplete IHC membrane staining of any intensity and each empty dot represents a negative tumour cell. In this example, circle 1 has 20% positive cells and comprises 50% of the entire tumour area; circle 2 has 10% positive cells and comprises 50% of the entire tumour area. The global % of incomplete HER2 membrane staining is [(50%x20%)+(50%x10%)]x100 (%)=5%. IHC, original magnification ×100 (B).
Table 2Stepwise approach for the proposed standardised HER2 IHC evaluation protocol
Two rounds of HER2 IHC evaluationIn both rounds, HER2 IHC was evaluated as the percentage (%) of neoplastic cells with membrane staining in 5% increments: <1, 1–5, 6–10, 11–15, 16–20, 21–25, 26–30, >30 (table 3). No additional specific instructions were given. A research assistant recorded the scores and the time spent per case.
Table 3Example HER2 IHC scoring sheet
Round 2 was performed using the same slides after 1 month of wash-out time. Training with the proposed protocol described above was provided to all pathologists before round 2. The training included a PowerPoint presentation and demonstration of HER2 IHC evaluation of 10 cases previously reported as HER2-negative (0/1+). These 10 training cases were not included in the study cohort. Within 1 week of the training session, the same six pathologists re-evaluated the same cases. The same research assistant recorded the scores and time spent per case.
Statistical analysisWe first evaluated the interobserver agreement in round 1 and 2 (before and after the training with the proposed algorithm). Three score systems were analysed: (A) incremental scores with 5% increase as <1%, 1%–5%, 6%–10%, 11%–15%, 16%–20%, 21%–25%, 26%–30%, >30%; (B) scores with three categories as <1% vs 1%–10% vs >10% and (C) scores with binary categories (<1% vs ≥1% or 0%–10% vs >10%) (figure 2).
Figure 2Examples of HER2 immunohistochemical (IHC) staining with no expression (A), <5% of tumour cells with weak incomplete membranous staining (B), 15% of tumour cells with weak incomplete membranous staining (C), and >30% of tumour cells with weak to moderate incomplete membranous staining (D). IHC, original magnification ×400 (A–D). IHC, immunohistochemistry.
Concordance was assessed using Kendall’s coefficient (W) for 5% increments and three categories (<1% vs 1%–10% vs >10%), and Fleiss kappa coefficient (K) for binary categories (<1% vs ≥1% or 0%–10% vs>10%). Kendall’s coefficient of concordance is a nonparametric statistic related to Friedman’s test and tests agreement of the raters’ rankings of the items. It is available when the response is ordinal and numerically coded. P values of <0.05 were considered significant. When using the Kendall’s coefficient analysis, agreement is assessed as fair (0.20 to <0.40), moderate (0.40 to <0.60), substantial (0.60 to <0.80) and perfect (≥0.80). In Kappa coefficient analysis, agreement is fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80) or perfect (0.81–1.00). Statistical analysis was conducted by using SAS V.9.4.
ResultsInterobserver variability without training with the proposed algorithmIn round 1, the six pathologists assessed HER2 expression in 5% increments. Average scoring time per case was 72 s (range 34–122). Interobserver agreement was substantial for ordinal scale ratings analysis with 5% increments and three categories (<1%, 1%–10% and >10%) (W=0.796 and W=0.768, respectively). Dichotomous scale analysis also demonstrated substantial agreement for the 1% cut-off (0% vs ≥1%) (K=0.650) and moderate for the 10% cut-off (0%–10% vs >10%) (K=0.569) (table 4).
Table 4Interobserver agreement in round 1 and round 2
Concordance was further analysed based on the pathologists’ seniority. For senior pathologists, interobserver agreement was almost perfect for ordinal analysis including 5% increments and 3 categories of <1%, 1%–10% and >10% (W=0.859 and W=0.837, respectively). Binary analysis also demonstrated substantial agreement for the 1% cut-off (0% vs ≥1%) (K=0.620), and moderate for the 10% cut-off (0%–10% vs >10%) (K=0.478). For junior pathologists, interobserver agreement was almost perfect for ordinal analysis including 5% increments (W=0.821) and substantial for 3 categories of<1%, 1%–10% and >10% (W=0.792). Binary analysis demonstrated substantial agreement with the 1% (0% vs ≥1%) cut-off (K=0.650) and moderate agreement with the 10% (0%–10% vs >10%) cut-off (K=0.576) (table 5).
Table 5Interobserver agreement in round 1 and round 2 by seniority
Interobserver variability after training with the proposed algorithmIn round 2, HER2 IHC was assessed using the proposed protocol. Average scoring time per case was 92 s (range 33–209) compared with 72 s in round 1. Interobserver agreement was almost perfect for ordinal analysis including 5% increments (W=0.804) and substantial for 3 categories of <1%, 1%–10% and >10% (W=0.764). Binary analysis demonstrated moderate agreement when using the 1% (0% vs≥1%) (K=0.590) or 10% cut-offs (0%–10% vs>10%) (K=0.549) (table 4).
Concordance was further analysed based on the pathologists’ seniority. For senior pathologists, interobserver agreement was almost perfect for ordinal scale ratings analysis including 5% increments and 3 categories of <1%, 1%–10% and >10% (W=0.872 and W=0.860, respectively). Dichotomous scale analysis demonstrated moderate agreement for the 1% cut-off (0% vs ≥1%) (K=0.554), and substantial for the 10% cut-off (0%–10% vs >10%) (K=0.712). For junior pathologists, interobserver agreement was perfect for ordinal analysis with 5% increments (W=0.813) and substantial for 3 categories of <1%, 1%–10% and >10% (W=0.768). Dichotomous scale analysis demonstrated moderate agreement with both the 1% (0% vs≥1%) and 10% (0%–10% vs >10%) cut-offs (K=0.560 and K=0.465, respectively) (table 2).
Comparison of interobserver variability before and after training with the proposed algorithmAverage scoring time per case was slightly higher in round 2 compared with round 1 (92 s vs 72 s). Interobserver agreement for 5% increments was substantial in round 1 (W=0.796) and increased to almost perfect (W=0.804) in round 2. The agreement slightly increased but remained substantial for three categories (<1% vs 1%–10% vs >10%) in round 2 (W=0.764) from round 1 (W=0.768). Junior and senior pathologists had almost perfect agreement for 5% increments in round 1 (W=0.821 and W=0.859) and round 2 (W=0.813 and W=0.872). For the three categories, agreement in round 2 slightly increased but remained almost perfect among seniors (W=0.837 vs W=0.860) and slightly decreased but remained substantial among juniors (W=0.792 vs W=0.768). For binary analysis using the 1% cut-off, agreement between juniors and seniors was substantial in round 1 (K=0.650 vs K=0.620) but decreased to moderate in round 2 (K=0.554 vs K=0.560). When using the 10% cut-off, the agreement of seniors increased from moderate (K=0.492) in round 1 to substantial (K=0.712) in round 2 and remained moderate among juniors (K=0.576 and K=0.465).
DiscussionAlthough conventionally only patients with HER2+ breast cancer have been eligible for HER2 targeted therapy,20 21 antibody drug conjugates such as T-DXd provide opportunities for patients with HLBC.22–24 At least 50% of breast cancers can be categorised as HER2-low, that is, showing IHC 1+ or 2+ scores with negative FISH.25–27 T-DXd had clinical benefits even in 30% of patients with HER2 IHC 0 scores in the DAISY trial, which is almost identical to that of patients with HER2 IHC 1+ scores28; this may be due to inclusion of HER2 ultra-low tumours (defined as incomplete and weak membranous staining in ≤10% of tumour cells) in the IHC 0 group, pathologists’ poor agreement of HER2 IHC evaluation or similar HER2 levels in HER2 ultra-low and HER2 1+ breast cancers.25 26 29 30 Accurate assessment of HER2 expression is therefore crucial for optimal clinical management.
Previous studies have demonstrated low reproducibility of HLBC assessment. The CAP surveys over 2 years from 1391 to 1452 laboratories of 40 cases from each laboratory (20 cases biannually for a total of 80) demonstrated ≤70% concordance for 0 vs 1+ scores in 19% of cases.14 In a cohort of 170 scanned breast biopsies, concordance among 18 pathologists was only 26% for 0/1+ scores compared with 58% for 2+/3+ scores.14 Of note, the study pathologists were unaware of the importance of identifying HLBC. Blinded analysis of 200 scanned HER2 IHC stained slides from 100 independent cases including all 4 HER2 categories (0, 1+, 2+, 3+) by 5 breast pathologists revealed substantial agreement (K=0.79) with a 35% overall discordance rate. The discordant cases consisted of 15 1+ vs 0 scores, 12 1+ vs 2+ scores, 1 2+ vs 0 score, 1 3+ vs 1 + score and 6 3+ vs 2+ scores. The agreement was almost perfect for the 0 and 3+ scores (K=0.82 and K=0.92, respectively), but only substantial for the 1+ and 2+ scores (K=0.67 and K=0.74, respectively).26
There are no guidelines or recommendations on evaluating HLBC. The International Ki67 in Breast Cancer Working Group has developed a Ki67 evaluation protocol with high interobserver agreement which requires evaluation of the percentage (%) of areas with low, medium and high Ki67 index and then counting 100 cells in each area to calculate the final global Ki67 score.31 Despite significant improvement of interobserver agreement, each case takes a median of 9 min.31 No such protocol exists for HER2 IHC. Our simplified version of the Ki67 protocol increased the average scoring time per case from 72 to 92 s. The interobserver agreement for 5% increments increased from substantial to almost perfect (W=0.796 vs W=0.804), and remained substantial for 3 categories of <1% vs 1%–10% vs >10% 1 (W=0.768 vs W=0.764). When analysing the data based on the pathologists’ seniority, the agreement remained almost perfect for both juniors (W=0.821 vs W=0.813) and seniors (W=0.859 vs W=0.872) for 5% increments. It also remained almost perfect among seniors (W=0.837 vs 0.860) and substantial among juniors (W=0.792 vs W=0.768) for the three categories. For binarised scores based on the 1% cut-off, agreement decreased from substantial to moderate among both juniors (K=0.650 vs K=0.560) and seniors (K=0.620 vs K=0.554). The agreement remained moderate among juniors (K=0.576 vs K=0.465) but increased from moderate to substantial (K=0.492 vs K=0.712) among seniors for binarised scores using the 10% cut-off. Of note, all study pathologists are subspecialised and were aware of the importance of separating 1+ vs 0 scores. Although the proposed protocol may be useful for uniform HER2 IHC assessment, the interobserver agreement in evaluating HLBC is suboptimal among subspecialised breast pathologists even after training with the protocol. The interobserver variability may be higher among general anatomic pathologists.
Several studies investigated the role of artificial Intelligence (AI) in evaluating HLBC. The performance of the AI digital image analysis (DIA)-assisted workflow was recently assessed in a cohort of 67 primary breast carcinomas and 30 metastases in which 3 breast pathologists independently assessed HER2 expression first visually (ground truth) and after being provided the results of the DIA. There was moderate agreement (K=0.59) between the ground truth and AI, with most discrepancies occurring between 0 and 1+ scores.32 Another study included 363 cases with HER2 IHC scores 0, 1+ and 2+ (without HER2 gene amplification) and available HER2 mRNA level. Artificial neural network analysis was then used to distinguish 0 vs 1+ scores. Score 1+ was refined as either faint staining in ≥20% of cells irrespective of the circumferential completeness, weak complete staining in ≤10% of tumour cells, or weak incomplete staining in >10% and moderate incomplete staining in ≤10% of tumour cells. Based on the refined criteria, 63% of cases were reclassified as HER2-low, and the refined scores showed perfect agreement with the original clinical scores.33 However, AI-based analysis may require substantial infrastructure investment and expertise. Evaluation under light microscope by pathologists is still the standard practice and a practical HER2 IHC evaluation algorithm will be of great help.
The interpretation of HLBC has been addressed by the recently published updated ASCO/CAP guidelines, which recommend including the following comment in biomarker reports: ‘patients with breast cancers that are HER2 IHC 3+or IHC 2+/ISH amplified may be eligible for several therapies that disrupt HER2 signalling pathways. Invasive breast cancers that test ‘HER2-negative’ (IHC 0, 1+or 2+/ISH not-amplified) are more specifically considered ‘HER2-negative for protein overexpression/gene amplification’ since non-overexpressed levels of the HER2 protein may be present in these cases. Patients with breast cancers that are HER2 IHC 1+or IHC 2+/ISH not amplified may be eligible for a treatment that targets non-amplified/non-overexpressed levels of HER2 expression for cytotoxic drug delivery (IHC 0 results do not result in eligibility currently)’.4 However, the most recent CAP breast biomarker protocol from March 2023 includes the following comment: ‘Breast cancers with HER2 IHC score 1+ or HER2 IHC score 2+ and a negative ISH result are eligible for clinically appropriate HER2-targeted therapy and may be reported as ‘HER2 Low’.34 ASCO/CAP 2023 guidelines and CAP biomarker reporting guidelines will yet have to be fully aligned.
Of note, the currently used HER2 IHC assays were designed to identify HER2+ tumours. The concept of HLBC is evolving, and whether these assays are ideal for identifying HLBC remains controversial. However, as long as IHC is used to evaluate HER2 expression, a standardised protocol is needed to improve interobserver variation. Furthermore, updated guidelines for IHC interpretation (analytical phase) as well as accurate and reproducible testing methodologies and strategies (preanalytical phase) are necessary to improve diagnostic sensitivity and avoid under-reporting or over-reporting of HLBC.35 36 Preanalytical factors are particularly complex and include length of fixation, antigen retrieval, antibody clones (eg, 4B5, CB11, HercepTest) and dilution, incubation time, temperature.37 38 These factors may have a tremendous impact on the accuracy and reproducibility of HER2 IHC results and identification of HLBC.37 39 Therefore, each laboratory should have well thought out and rigorous quality control procedures in place for the preanalytical phase combined with well-defined guidelines for HER2 IHC assessment for HLBC.35 40
Our study has limitations. Although IHC has been the standard method of evaluating HER2 expression in invasive breast cancer for years, it may not be the most optimal technique for semiquantifying low levels of HER2 protein expression. A broad range of HER2 mRNA expression has been identified in tumours without detectable HER2 protein by IHC.41 Distinguishing between 0 and 1+ scores appeared challenging with the proposed algorithm, and the clear cut-off point to determine the eligibility for T-DXd remains to be determined. Additionally, the Ventana HER2 4B5 antibody clone has been shown to identify a higher proportion of HER2-low tumours compared with HercepTest (27.4% vs 9.2%).42 It is possible that the use of the HercepTest antibody may have affected the incidence of HLBC; however, the main aim of the study was to assess interobserver variability which is unlikely to be dependent on the antibody clone.
In conclusion, this analytical validation study indicates that subspecialised breast pathologists have suboptimal agreement in evaluating HLBC. Although our proposed algorithm using the modified Ki-67 assessment methodology did not significantly improve interobserver variability among breast pathologists for IHC evaluation of HLBC, it may help improve interobserver agreement among general anatomic pathologists. There is an urgent need to develop a new assay or algorithm to reliably evaluate HLBC.
Comments (0)