If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
To evaluate the accuracy of a deep learning software (DLS) in the discrimination between phyllodes tumors (PT) and fibroadenomas (FA).
In this IRB-approved, retrospective, single-center study, we collected all ultrasound images of histologically secured PT (n = 11, 36 images) and a random control group with FA (n = 15, 50 images). The images were analyzed with a DLS designed for industrial grade image analysis, with 33 images withheld from training for validation purposes. The lesions were also interpreted by four radiologists. Diagnostic performance was assessed by the area under the receiver operating characteristic curve (AUC). Sensitivity, specificity, negative and positive predictive values were calculated at the optimal cut-off (Youden Index).
The DLS was able to differentiate between PT and FA with good diagnostic accuracy (AUC = 0.73) and high negative predictive value (NPV = 100%). Radiologists showed comparable accuracy (AUC 0.60–0.77) at lower NPV (64–80%). When performing the readout together with the DLS recommendation, the radiologist’s accuracy showed a non-significant tendency to improve (AUC 0.75–0.87, p = 0.07).
Deep learning based image analysis may be able to exclude PT with a high negative predictive value. Integration into the clinical workflow may enable radiologists to more confidently exclude PT, thereby reducing the number of unnecessary biopsies.
Phyllodes tumor (PT) of the breast are rare breast lesions, accounting for less than 1% of all breast tumors. They are typically seen in women aged 35 to 55 years at presentation and are mostly large with a median size of 4 cm [
]. Histologically, they are characterized by “leaf-like” lobulations, from which the name is derived (Greek phullon leaf), with more abundant and cellular stroma than that of fibroadenoma (FA). PT are commonly classified into categories of benign, borderline, or malignant on the basis of histological parameters such as mitotic count, cellular atypia, stromal cellularity and overgrowth, and the nature of tumor borders [
]. Histologically, benign PT can be mistaken for FA, whereas at the other end of the spectrum, malignant PT show overlapping features with primary breast sarcomas or spindle cell metaplastic carcinoma. However, regardless of their histology, all PT can recur, where an increased risk of local recurrence is correlated with larger size and malignancy [
FA is the most common benign tumor of the breast in women under 35 years of age. They present as well-defined, mobile masses that can increase in size and tenderness in response to high levels of estrogen (e.g. during pregnancy or prior to menstruation). Histologically, they are made up of both glandular breast tissue and stromal tissue. In contrast to PT, risk of cancer is usually not increased in FA [
In addition to their histopathological similarities, FA are usually indistinguishable from PT on a macroscopic level. Both fibroepithelial tumors are often detected as fast growing breast lumps, and distinguishing PT from FA by means of physical exam is extremely difficult. With increased public awareness and screening, most of the breast tumors are being discovered at earlier stages, when both tumors share a substantial overlap in sonographic features and size [
]. Furthermore, sonography cannot distinguish between malignant, borderline and benign PT. Diagnostic evaluation is therefore often extended to the use of invasive diagnostic procedures, such as core-needle biopsies. However, even with the help of histology, diagnosis can be complicated due to sampling errors.
The diagnosis has wider implications that also influence the therapeutic approach to these tumors. Although conservative management is an acceptable strategy in FA, malignant PT should be completely enucleated with clear margins due to the high recurrence rate of up to 30% [
Traditionally, patients with breast masses that cannot clearly be identified as FA or PT will usually undergo complete surgical excision or mastectomy, for the fear of overlooking a potentially malignant tumor. Therefore, accurate identification and differentiation of PT preoperatively is critical to appropriate surgical planning, avoiding operative complications resulting from inadequate excision or surgical overtreatment. Most FA do not need surgical treatment at all. In these cases, biopsies are essentially an unnecessary physical, psychological and financial burden for the patient [
Deep learning is a type of machine learning that was inspired by the structure and function of the brain. It imitates the mammalian visual cortex in processing data using artificial neural networks (ANNs) that contain hidden layers. The deep learning software (DLS) learns to extract meaningful features from images to then make inferences and decisions on its own. “Meaningful” in this context stands for “helping to solve the problem at hand”, in our case discriminating FA from PT. This data-driven method has shown promising results in recent years, as opposed to older more algorithmic approaches with hand-crafted features, which may often yield many arbitrary features not useful for the problem at hand. Hence, the use of deep learning in radiology as a method of differentiating and diagnosing tumors is a rapidly growing field [
]. Although, as with any diagnostic test, false-positive results can occur, the sensitivity of deep learning e.g. in mammography has reached numbers of up to 84%, equaling or surpassing the diagnostic accuracy of seasoned specialists [
]. Deep learning can be integrated into the assessment of sonographically detectable lesions and could be performed in the initial evaluation of indeterminate breast tumors (illustrated in Fig. 1).
In this retrospective, single-center study, we aimed to evaluate the precision of a DLS in the discrimination between PT and FA.
2. Materials and methods
2.1 Ultrasound examination and reference standard
This retrospective study was approved by the IRB, who waived the need for informed consent. All patients from a two-year period (July 2013 – July 2015) were reviewed for the presence of PT with histology as a reference standard (n = 11). From the remaining patients, a random subset with histologically secured diagnosis of a FA was taken (n = 15). Due to the low number (n = 4), FA with histopathological phyllodes features were counted towards one of the other groups. Since the management at our institution for those lesions is surgical excision, they were counted as PT. Median lesion diameter (long axis) was 21.5 mm (interquartile range 18–26 mm) for FA and 26.0 mm for PT (19–37 mm, p = 0.25). Lesion volume as calculated with all three diameters and the ellipsoid formula was also not significantly different (13.6 vs. 24.3 cm3, p = 0.55). Mean age ± 95% confidence interval was 33.6 ± 15.2 years. All examinations were performed on the same type of ultrasound device (Logiq E9, GE Healthcare, Chicago, IL, USA) with the same reconstruction setting (“Breast”). For large lesions, multiple focus points were used. Functional ultrasound images were not consistently acquired and hence not included for analysis (i.e. with doppler or elastography overlay). For lesions depicted in multiple images, all available data was used, resulting in a total of 50 images of FA and 36 images of PT. The raw DICOM images were converted into lossless, monotone jpeg for further processing.
2.2 Deep learning image analysis
Image analysis was performed with a DLS originally developed for industrial image analysis (ViDi Suite Version 2.0; Cognex Inc, Natick MA, USA). This software takes advantage of the latest advances in deep learning algorithms to classify anomalies in images [
]. It is currently used in various industries for real-time quality inspection e.g. in defect detection of metal surfaces, traffic analysis or appearance based product identification. It is currently not FDA approved but has recently shown promising results for detecting cancer in a dual-center mammography study [
]. All computations were performed on a GeForce GTX 1080 graphics processor unit. In a first step, the images were cropped to the actual lesion by using the supervised ViDi Detection Tool, the architecture of which is optimized for anomaly localization in homogeneous patterns (i.e. subcutaneous fat). In a second step, the cropped lesions were analyzed using the ViDi Classification Tool, which in turn is optimized for image classification. A randomly chosen subset of images (n = 53) was used for the training of the software (training set), and the remaining images (n = 33) were withheld from the software and solely used to validate the resulting model after training (validation set). The split was performed on a per-patient basis.
2.3 Radiologist’s readout
The cropped lesions were exported and presented to four radiologists in random order (INITIALS BLINDED FOR REVIEW: Board certified radiologists with 8, 3 and 2 years of experience in breast imaging, as well as a PGY-3 resident, referred to as reader 1–4, respectively). The readers were blinded to the study design as well as the clinical information of the patients. The images were rated on a 5-point Likert-like scale reflecting the confidence of the reader in his or her diagnosis (1=definitely FA, 5=definitely PT). After a four-month waiting period to avoid memory bias, the radiologists rated the lesions of the validation set again. This time, the DLS rating was shown below the image and the radiologists were asked to take it into consideration as well.
2.4 Statistical analysis
The statistical analysis was performed in R version 3.3.1 (R Foundation for Statistical Computing, Vienna, Austria). Continuous variables were expressed as median and interquartile range, categorical variables as counts or percentages. Interreader agreement was assessed pair-wise with the weighted Cohen’s Kappa [
]: <0.20, poor, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80, good and 0.81–1.0 very good agreement.
Diagnostic performance was assessed with a receiver operating characteristic (ROC) analysis for the computer test and the human readers. Diagnostic accuracy was expressed as the area under the receiver operating characteristic curve (AUC) and compared with DeLong’s nonparametric test for paired data [
]. Sensitivity, specificity, positive and negative predictive values were calculated at the optimal cut-off (Youden-Index). A p-value <0.05 was considered indicative of significant differences. All tests were two-tailed.
3.1 Deep learning image analysis
The DLS showed an excellent AUC of 0.89 on the whole data set, and an AUC of 0.73 on the validation data, indicating some overfit to be present in our rather small data set. In order to demonstrate the generalizability to new cases, Fig. 2 depicts the performance on the validation data only. On both training and validation data the DLS exhibited a high specificity of 1.0 (summarized in Table 1).
Table 1Diagnostic performance (area under the ROC curve + 95% CI) and sensitivity, specificity, positive and negative predictive value (NPV) of the DLS for all cases as well as the validation set separately, and each reader (all data).
Table 2 demonstrates that overall, interreader agreement was rated as fair (0.21–0.40) or moderate (0.41–0.60). Reader 1 showed the lowest interreader agreement (0.3, 0.21 and 0.21, respectively). Reader 2 and 3 exhibited the highest interreader agreement of 0.47 (moderate). The level of agreement for reader 4 gradually increases for each reader (0.21 for reader 1, 0.31 for reader 2 and 0.36 for reader 3). Readers 2, 3, and 4 show very similar AUCs (0.77, 0.75, and 0.74), while reader 1 showed the lowest AUC of 0.60. In the validation set (Fig. 2), reader 4 exhibited highest specificity and reader 3 the highest sensitivity. The confusion matrices with the readers’ ratings vs. reference standard are summarized in Table 3.
Table 2Pairwise interreader agreement measured by weighted Cohen’s Kappa (95%-CI in brackets).
Diagnostic performance (AUC) was not significantly different between the DLS (validation data) and any of the readers (p-values 0.31, 0.80, 0.66, 0.87; example case shown in Fig. 3). However, at the optimal cut-off, the DLS was more sensitive than the readers, resulting in some false positives (example shown in Fig. 4) but consistently exhibited higher NPV and specificity (example case shown in Fig. 5).
In the second readout, three out of the four readers showed a non-significant tendency of improved performance (p = 0.07), with reader 1 improving the most (from 0.61 to 0.77), and reader 2 slightly decreasing (from 0.77 to 0.75). Readers 3 and 4 moderately improved (0.75 to 0.87 and 0.74 to 0.84). In general, there was a higher gain in specificity than in sensitivity as can be seen in the ROC curves in Fig. 6.
In this pilot study, we have investigated whether a DLS can extract meaningful features from ultrasound image data and learn to distinguish PT from FA. We found that the software may be able to exclude PT with a high negative predictive value. Furthermore, we were able to show that combining the DLS estimate with the radiologist’s impression leads to significantly better diagnostic performance.
The most widespread diagnostic and screening management of breast masses include physical examination, radiographic assessment (ultrasonography or mammography), and, if indicated, tissue specimen analysis (fine-needle aspiration or core needle biopsy). However, these diagnostic tests often fall short in differentiating PT from FA. Our results reflect the current controversy among radiologists in diagnosing PT and FA based on ultrasound images, evident by the poor interreader agreement in Table 1. Although MRI findings can be used to help determine the histological grade of known breast PT, MRI findings have been reported to be insufficient for reliable differentiation between FA and PT [
]. In this study, we show that deep learning image analysis can use ultrasound images to discriminate PT from FA with a specificity and negative predictive value that surpasses that of experienced radiologists. Furthermore, the software reached a diagnostic performance of 0.73 in the validation set, with the readers reaching comparable performance. Interestingly, the diagnostic performance of the radiologists did not correlate with their years of experience, illustrating the ambiguity and lack of distinctive characteristics for either tumor. Furthermore, this could be due to the fact that the incidence of PT is very low and the radiologists had not been exposed to a high absolute number of cases despite years of experience (Table 2). Interreader agreement decreases for each reader, which may be explained by the gradually decreasing level of experience between readers (8, 3 and 2 years, and PGY-3 resident, respectively). Therefore, it is not surprising that reader 1 and 4 - the most experienced and the most inexperienced reader - show the lowest interreader agreement, and that readers 2 and 3 - with only 1 year difference training - demonstrate the most similar results.
When compared to DLS, the results show that the radiologists achieve higher readout specificity and thus positive predictive value - the ability to correctly identify PT - whereas deep learning image analysis showed the strongest performance in its negative predictive value - the ability to correctly exclude patients without PT. Hence, augmenting the reader’s impression with the DLS estimate led to a significant increase in diagnostic accuracy. This improvement indicates that supplementation of deep learning image analysis into the diagnostic workup can enhance the accuracy in differentiating PT from FA. DLS is already integrated into routine diagnostics with a level of competence comparable to radiologists [
One of the limitations of our study design was that we only trained the software to distinguish between two different types of breast masses, PT and FA. This means that it cannot detect other lesions, such as invasive cancers or scars that may be important differential diagnoses. Furthermore, the software in its current form showed a high specificity and negative predictive value, meaning that it would correctly identify unaffected patients but not reliably identify patients who need treatment. This shortcoming seemed to be offset to a certain degree by using the software as a supplement to the radiologist’s decision. Therefore, the momentary software would mostly be suitable as an adjunct tool to supplement a radiologist’s diagnosis. Future studies and refinements of the software might allow deep learning to act as a screening tool for all types of breast lesions.
Further limitations of our study are the small sample size, the retrospective design as well as the restricted experimental setting. In the clinical routine, FA are far more common than PT. Since the software was trained on a cohort with a high prevalence of PT, it would possibly overestimate the occurrence of PT in the clinical routine. However, the high NPV should theoretically prevail or even increase.
DLS is novel method that has not yet been approved by the FDA or any other regulatory body. Furthermore, the cost-effectiveness of a DLS implementation has not yet been examined. These are some of the many questions that must be addressed before its broader use.
In conclusion, computer-assisted diagnosis in the form of deep learning image analysis is a useful tool to differentiate patients with PT and FA. A decision by the examining radiologist supplemented by the aid of DLS provides the highest diagnostic performance, and its integration into clinical routine may enable doctors to more confidently exclude PT, resulting in less unnecessary biopsies.
Conflicts of interest
The authors of this manuscript declare no relevant conflicts of interest, and no relationships with any companies, whose products or services may be related to the subject matter of the article.