banner
News center
Instant delivery

WCAY object detection of fractures for X-ray images of multiple sites | Scientific Reports

Nov 07, 2024

Scientific Reports volume 14, Article number: 26702 (2024) Cite this article

88 Accesses

Metrics details

The WCAY (weighted channel attention YOLO) model, which is meticulously crafted to identify fracture features across diverse X-ray image sites, is presented herein. This model integrates novel core operators and an innovative attention mechanism to enhance its efficacy. Initially, leveraging the benefits of dynamic snake convolution (DSConv), which is adept at capturing elongated tubular structural features, we introduce the DSC-C2f module to augment the model’s fracture detection performance by replacing a portion of C2f. Subsequently, we integrate the newly proposed weighted channel attention (WCA) mechanism into the architecture to bolster feature fusion and improve fracture detection across various sites. Comparative experiments were conducted, to evaluate the performances of several attention mechanisms. These enhancement strategies were validated through experimentation on public X-ray image datasets (FracAtlas and GRAZPEDWRI-DX). Multiple experimental comparisons substantiated the model’s efficacy, demonstrating its superior accuracy and real-time detection capabilities. According to the experimental findings, on the FracAtlas dataset, our WCAY model exhibits a notable 8.8% improvement in mean average precision (mAP) over the original model. On the GRAZPEDWRI-DX dataset, the mAP reaches 64.4%, with a detection accuracy of 93.9% for the “fracture” category alone. The proposed model represents a substantial improvement over the original algorithm compared to other state-of-the-art object detection models. The code is publicly available at https://github.com/cccp421/Fracture-Detection-WCAY .

Bone trauma, arising from incidents such as jostling, falls, and car accidents, is a prevalent occurrence in modern life. It encompasses a range of injuries, including fractures, cracks, tears, and compression injuries. Symptoms typically manifest as pain, swelling, and restricted movement, potentially leading to complications such as nonunion and infection1. Timely diagnosis and appropriate treatment are crucial in managing bone trauma, given the unpredictable nature of injury occurrence and variations in medical expertise among treating physicians2. The advent of artificial intelligence offers promising solutions to the clinical complexities associated with orthopedic trauma3.

Deep learning, a pivotal subset of artificial intelligence, has garnered significant attention for its applications in fracture detection and as a supplementary tool for clinician diagnostics4. Fracture detection primarily utilizes X-ray and computed tomography (CT) images, with X-ray image research being particularly prevalent5. Consequently, fracture detection within deep learning frameworks can be conceptualized as an object detection task6.

Object detection algorithms serve the purpose of identifying both the location and class of targets within an image7. These algorithms predominantly rely on convolutional neural networks and are categorized into two main types: two-stage models and single-stage models8. Two-stage models typically involve the generation of candidate regions from the input image, followed by classification and regression9. Examples include R-CNN10, Fast R-CNN11, and Faster R-CNN12, which are known for their higher detection accuracy. In contrast, single-stage models simplify the problem by treating object detection as a regression task and performing global regression-based classification13. Models such as the You Only Look Once (YOLO)14 series and RetinaNet15 directly extract class and location information without the need for candidate region generation.

Furthermore, the ongoing pursuit of improving model detection performance in neural network models for object detection remains a prominent research focus. Enhancement strategies primarily revolve around data augmentation and network architecture modifications16. Of particular interest in network structure enhancement is the integration of attention mechanisms, representing a current area of active exploration and research16.The attention mechanism is a unique structure embedded in machine learning models that automatically captures the contribution of input data to output data17. The basic principle of attentional mechanisms in computer vision is to find the correlation between the raw data and then emphasize some key features18, such as the squeeze-and-excitation (SE) attention method19, convolutional block attention module (CBAM)20, global attentional mechanism (GAM)21, and coordinate attention (CA)22.

Therefore, we use the FracAtlas X-ray dataset23, which is a collection of X-ray scan image data from multiple parts of the hand, shoulder, leg, and foot. A generalized X-ray fracture detection model is designed. We introduce the dynamic snake convolution C2f (DSC-C2f) operator, which is designed to efficiently extract slender fracture features. In addition, we introduce a novel WCA attention mechanism to improve the detection accuracy. Leveraging insights from the YOLO family of single-stage detection algorithms, we develop the WCAY fracture detection model. To improve the overall efficacy of the proposed model, we incorporated the YOLO algorithm for different model sizes, including the Nano, Small and Medium models. To validate the feasibility of the model, we trained the model on the GRAZPEDWRI-DX public dataset9, 222. https://doi.org/10.1038/s41597-022-01328-z (2022)." href="/articles/s41598-024-77878-6#ref-CR24" id="ref-link-section-d138656881e479">24. The contributions of this paper are summarized as follows:

Leveraging dynamic snake convolution (DSConv)25, we introduce a learning residual module, DSC-C2f, capable of capturing tubular structures.

We propose a weighted channel attention mechanism (WCA).

We propose a new object detection network called weighted channel attention YOLO (WCAY) that incorporates some of the above attention mechanisms as well as the WCA and DSC-C2f proposed in this paper.

The feasibility of DSC-C2f, WCA and WCAY was verified with several datasets.

Fracture detection, a critical aspect of medical imaging, has seen widespread application. Guan et al.26. utilized the R-CNN model on the MURA dataset27, achieving an average accuracy of 62.04%. Yahalomi et al.28. demonstrated the effectiveness of a Faster R-CNN model in localizing distal radius fractures, surpassing radiologists’ performance and offering promise in rare disease identification. Wang et al.29. introduced ParallelNet, an R-CNN network with a TripleNet backbone, for thigh fracture detection in a dataset comprising 3842 X-ray images. Similarly, Krogue et al.30. employed a RetinaNet model utilizing DenseNet169 for automatic detection, localization, and classification of hip fractures.

While these two-stage algorithms boast high accuracy, their speed remains a concern. Achieving a balance between accuracy and speed is imperative. Single-stage object detection algorithms, exemplified by the YOLO family, have emerged as significant contributors in this realm. Li et al.31. applied the YOLOv3 model to vertebral fracture detection, demonstrating its effectiveness. Yuan et al.32. innovatively integrated external attention and 3D feature fusion into YOLOv5 to detect skull fractures in CT images. Warin et al.33. leveraged YOLOv5 to detect mammofacial fractures in a substantial dataset, classifying fracture conditions into frontal, midfacial, and jaw fractures and no fractures. Mushtaq et al.34. demonstrated the proficiency of the YOLOv5 model in lumbar vertebrae localization, achieving an impressive average accuracy of 0.975. Furthermore, in pediatric wrist fracture detection, Dibo et al.35. enhanced YOLOv7 with the CBAM attention mechanism, achieving improved performance on the GRAZPEDWRI-DX dataset. Moreover, Rui et al.36. utilized the YOLOv8 model for wrist fracture detection, presenting an application tailored for this purpose.

However, due to the difficulties in establishing a high-quality fracture image dataset and the subjective nature of doctors’ image annotations, a completely uniform standard does not exist, and deep learning-based fracture diagnosis studies are usually conducted for specific fracture types37. Therefore, it is particularly important to develop a deep learning model for fracture detection that is applicable to various types of images and different fracture sites.

Redmon et al.14. introduced the YOLO architecture in 2015 for real-time detection, aiming to address target detection as a regression challenge. This approach involves directly mapping coordinates and class probabilities from image pixels to bounding boxes using a single neural network model. YOLOv838, the latest iteration proposed by Glenn Jocher, represents a significant improvement over YOLOv539. Notably, YOLOv8 replaces the C3 module with the more efficient C2f module, which features a CSP bottleneck with two convolutions instead of three, along with adjustments to the number of channels. Moreover, the head section undergoes modifications to employ the decoupled head technique, separating classification and detection tasks.

To address issues such as inaccurate fracture detection, excessive model parameters, large model sizes, and limited detection sites in traditional networks, this study introduces a novel X-ray fracture detection model named WCAY (shown in Fig. 1). Leveraging YOLOv8s as the baseline network, we incorporated the DSC-C2f core operator into the network backbone to enhance the model’s sensitivity to elongated and curved tubular structures typical of fractures. Additionally, we integrate a self-developed attention module (WCA) into the neck network to enable the model to prioritize abnormal regions while suppressing non anomalous areas, thereby enhancing overall performance.

Model structure of WCAY.

The YOLOv8s network architecture incorporates numerous C2f modules, which are primarily tasked with learning residual features. Therefore, the network’s performance is heavily reliant on the effectiveness of these C2f module features. Given the significant variations in fracture morphology, location, and size—particularly with crack-like fractures, which exhibit diverse shapes and sizes—the original C2f module may struggle to adequately extract such small, localized features. To address this limitation and further bolster the network’s ability to learn fracture features, this paper introduces the DSConv from the dynamic snake convolution network (DSCNet). Subsequently, a new module, termed the DSC-C2f module, is meticulously designed.

In 2023, Yaolei Qi et al.25. developed the DSCNet network, which is specifically tailored for tubular structure segmentation tasks. Within the DSCNet network, DSConv emerged as a convolutional module, offering a novel approach to traditional convolution. As illustrated in Fig. 2, DSConv demonstrates distinctive operational characteristics. To effectively extract local features of tubular structures and enable the convolutional kernel to focus on intricate geometric features, DSConv introduces deformation offsets. By sequentially examining each target for processing, DSConv ensures consistent attention. Additionally, the incorporation of significant deformation offsets prevents the spreading of sensory fields too extensively, resulting in an output feature map resembling a “snake” shape.

Schematic of how DSConv works. Dynamic snake convolution (DSConv) learns deformations based on input feature maps and adaptively focuses on elongated and tortuous local features based on an understanding of the morphology of tubular structures25.

Figure 3 illustrates the structure of DSC-C2F. The DySnakeConv module is formed by linking two initial DSConv layers with a convolution module (ConvM) layer. Initially, the ConvM layer increases the number of channels in the expansion layer. Subsequently, the DySnakeConv module is applied to the feature map, followed by the utilization of a second ConvM layer to reduce the number of channels in the output feature map to align with the input channels. Finally, the feature obtained in the preceding stage is merged with the residual edge for feature fusion, thus constituting the dynamic snake convolution bottleneck (DSC-Botneck) module. The newly designed DSC-C2f module is a DSC-Botneck module that replaces all the bottleneck components of the original C2f module in the network model. This DSC-C2f module brings together the multiscale feature extraction capabilities of the original C2f module with DSConv’s ability to pay adaptive attention to slender and curvilinear features.

Structure of DSC-C2f.

The attention mechanism plays a crucial role in capturing the aspect of focus in the whole image to further enhance the model’s focus on the image features of the abnormal bone region and improve the model generalizability. However, it is important to note that utilizing the attention mechanism also has the disadvantage of increased computational effort, leading to increased computational cost. We design a new channel attention mechanism, weighted channel attention (WCA), inspired by the CA (coordinate attention)22 module, as shown in Fig. 4.

Principle of the WCA. Here, “X avg pool” represents 1D horizontal global pooling, and “Y avg pool” indicates 1D vertical global pooling22.

This WCA module can be viewed as a computational unit designed to improve the representation of features learned by the network. It can take as input any intermediate feature tensor \(\:X\in\:{R}^{C\times\:H\times\:W}\), where \(\:C\) denotes the number of input channels and \(\:H\) and \(\:\:W\) denote the spatial dimensions of the input features. To clearly describe the proposed WCA, we first revisit the embedding of location information into the channel attention CA, as shown in (a) in Fig. 5.

The CA decomposes the original input tensor \(\:X\) into two parallel one-dimensional feature encoding vectors for modeling cross-channel dependencies with spatial location information. The following two formulas represent two one-dimensional vectors each from a one-dimensional global average pooling along the horizontal dimension so that it can be viewed as a collection of positional information along the vertical dimension. The one-dimensional global average pooling that encodes global information along the horizontal dimension of \(\:C\) with height \(\:H\) can be expressed as Eq. (1). Similarly, the output of the pooling in \(\:C\) with width \(\:W\) can be expressed as Eq. (2).

Here, \(\:{x}_{c}\) denotes the input feature in channel \(\:c\). Through such an encoding process, CA captures the long-distance dependencies in the horizontal dimension direction and preserves the exact position information in the vertical dimension direction. The model uses input feature encoding to synthesize global information to help capture spatial global features. It then generates two parallel 1D vectors for feature coding and permutes the shape of one of the vectors before merging the two. Immediately after, these parallel encoded vectors are shared with the downscaled 1 × 1 convolution. Coordinate attention (CA) then decomposes the 1 × 1 convolution output into dual parallel 1D feature encoding vectors. Each path contains a 1 × 1 convolution and a nonlinear sigmoid function. Finally, the attentional weights of the two paths are applied to the original feature map to produce the final output. This approach preserves accurate spatial details while efficiently exploiting long-range dependencies through interchannel and spatial information coding.

Although CA embeds precise positional information into channels, utilizing this spatial capture of long-distance interactions improves the model’s concentration of fracture features40. However, the excess of long-range temporal information causes the model to miss crucial feature details during multiscale fusion, leading to overfitting. As a result, the fracture feature localization becomes diffuse and unconstrained, with the model capturing a wide range of focal points beyond the pre-labeled bounding box in the image. To solve this problem of concentration diffusion, we designed the WCA module, whose overall structure is shown in (b) in Fig. 5.

Comparisons with different attention modules: (a) CA module; (b) WCA module.

Specifically, given the aggregated feature maps produced by Eq. (1) and Eq. (2). We first concatenate them and send them to a 3 × 3 convolutional transform function \(\:{F}_{3\times\:3}\) to obtain the following formula:

where \(\:\left[\bullet\:,\:\bullet\:\right]\) denotes the join operation along the spatial dimension and \(\:f\in\:{R}^{C\times\:1\times\:(W+H)}\) is the intermediate feature map encoding spatial information in the horizontal and vertical directions. We then split \(\:f\) into two separate tensors \(\:{f}^{H}\in\:{R}^{C\times\:H\times\:1}\) and \(\:{f}^{W}\in\:{R}^{C\times\:1\times\:W}\) along the spatial dimension. Then to obtain the feature weights for each of the two tensors in the vertical and horizontal dimensions, we feed \(\:f\) into a 1 × 1 convolutional transform to obtain the following

where \(\:\sigma\:\) is a sigmoid function, and similarly, we split \(\:w\) along the spatial dimensions into two separate feature weights \(\:{w}^{H}\in\:{R}^{C\times\:H\times\:1}\) and \(\:{w}^{W}\in\:{R}^{C\times\:1\times\:W}\). We then aggregate the dimension tensors and weights via simple multiplication to obtain Eq. (5) and Eq. (6)

Finally, by multiplying the output of the two parallel routes with the original input feature map, the output Y of our WCA module can be written as

In contrast to channel attention, which solely recalibrates the significance of various channels, our WCA block not only incorporates spatial information encoding but also amplifies constraints, prioritizing spatial details. As elucidated earlier, weighted attention is concurrently applied along both the horizontal and vertical directions to the input tensor. Each element within these attention maps signifies the presence of the object of interest in the corresponding row and column. This encoding mechanism enables our WCA to precisely pinpoint the exact position of an object, thereby facilitating improved recognition by the overall model.

The FracAtlas and GRAZPEDWRI-DX datasets were used in this study. The FracAtlas dataset is composed of 4083 bone fracture images of X-rays from all major parts of the human body collected from three major hospitals in Bangladesh, as shown in Fig. 6. This dataset was manually annotated with the help of two radiologists and an orthopedic surgeon and contains 717 images with 922 fracture instances23. The GRAZPEDWRI-DX dataset, shown in Fig. 7, was collected by a number of pediatric radiologists at the Department of Pediatric Surgery at the University Hospital Graz. A total of 10,643 wrist site studies involving 20,327 image samples involving 6,091 unique pediatric patients were performed9, 222. https://doi.org/10.1038/s41597-022-01328-z (2022)." href="/articles/s41598-024-77878-6#ref-CR24" id="ref-link-section-d138656881e1127">24. The dataset was annotated by a group of pediatric radiologists. There are nine different types of annotation objects, and each image can be associated with multiple objects35–36.

Fracatlas dataset, showing scans containing various parts of the arm, leg, waist and shoulder. Each fracture instance has its own mask and bounding box, and the scans also have a global label for the classification task, which is set to “fractured”.

The GRAZPEDWRI-DX dataset, which shows the wrist fracture conditions in children from this dataset, is shown in the figure. Because there are fewer images in the metal category, we included the metal category in the foreign body category to guarantee the convergence of the dataset. The dataset categories are classified as “fracture”, “text”, “periosteal reaction”, “pronatorsign”, “pronatorsign”, “softtissue”, “foreignbody”, " boneanomaly”, and “bonelesion”.

In addition, the restricted image diversity observed in low-feature X-ray images poses a challenge, as models trained solely on such data may exhibit suboptimal performance when applied to other X-ray images. To enhance the robustness of these models, we employ data augmentation techniques aimed at improving image quality. Specifically, we implement online data augmentation on the training dataset, leveraging methods such as mosaic and mixup. Additionally, we fine-tune image brightness and contrast to further enhance model quality utilizing Albumentations41, an open-source Python library renowned for its image enhancement capabilities.

This research does not involve human participants and/or animals. All methods complied with the guidelines and relevant regulations.

This experiment was conducted on an Ubuntu 18.04 system equipped with an Intel(R) Xeon(R) Platinum 8255 C CPU and an NVIDIA GeForce RTX 3090 GPU; utilizing torch version 1.11. During training, the input image resolution was set to 640 × 640 pixels. The model was trained for 300 epochs with a patience of 50, a batch size of 32, and a learning rate of 0.01 utilizing the “SGD” optimizer. Each dataset was randomly divided into three subsets—training, validation, and test sets—comprising approximately 70%, 20%, and 10% of the original dataset, respectively.

The key evaluation metrics of object detection algorithms include detection accuracy, model complexity, and detection speed. We introduce the key metrics of precision, recall and mAP to evaluate the model detection accuracy. The precision and recall are calculated via Eq. (8) and Eq. (9)

In the evaluation of target detection algorithms, true positives (TP) represent correctly detected positive samples, false positives (FP) represent negative samples incorrectly identified as positive, and false negatives (FN) represent positive samples erroneously identified as negative. A precision-recall (P-R) curve is generated for each category during the performance assessment, depicting the accuracy against the recall42. The area under this curve, which spans between the curve and the horizontal axis, denotes the average precision (AP) of the category. The mAP value of the model is computed as the average of the AP values across all categories. Typically, mAP is assessed using two metrics: mAP50, which considers predictions with at least 50% overlap with true frames as correct, and mAP50:95, which evaluates IOU thresholds ranging from 0.5 to 0.95.

The complexity of an object detection algorithm is gauged by various factors, such as model size, parameter count, and computational demands. Elevated values in these aspects correlate with increased model complexity. This study assesses model complexity through evaluation metrics encompassing computational load and model size. The computational load, which is indicative of time complexity, is quantified in floating-point operations (FLOPs), where one GFLOPs equals one billion floating-point operations per second. Higher computational demands signify greater computational resource requirements.

To demonstrate the effectiveness of WCAY, we chose YOLOv8s as the baseline network (Baseline) and added the DSC-C2f module to the backbone network as well as the reneck network with our WCA attention mechanism. We performed ablation studies mainly on the FracAtlas dataset, testing different combinations of several improved modules.

Table 1. Ablation experiment.

To demonstrate the effectiveness of DSC-C2f in the detection task and the effect of DSC-C2f at different positions in the network on the detection performance, we uniformly conducted a series of positional substitution comparison experiments on DSC-C2f on the FracAtlas dataset.

As seen in Fig. 1 Model structure of WCAY, a C2f layer is set up in the P2, P3, P4 and P5 layers of the original network backbone to extract features from the input image, and we replace the C2f of each layer with the DSC-C2f module in turn. As shown in Table 2, the accuracy of the model is improved to different degrees after replacing the C2f layer in the original model with DSC-C2f, which reflects the excellent ability of the DSC-C2f module to extract fracture tubular features. In addition, different arrangements of the same number of modules produce different results. When we replace the C2f module in the P5 layer with DSC-C2f, the model detection accuracy improves the most, by 5.7%, from 47.9% in the baseline model mAP50 to 53.6%, compared to when we replace it in positions P2, P3, and P4. Although the number of parameters is improved, the corresponding improvement in accuracy values is the most effective.

Figure 8 illustrates the impact of DSC-C2f on model accuracy across various locations. Over time, the precision and recall curves consistently surpass the baseline curve. Notably, the most effective strategy, yielding the maximum mAP, involves replacing DSC-C2f at layer P5. This approach ensures that the model maintains its initial precision and recall levels while enhancing accuracy, thereby influencing the mAP positively.

Comparison of the precision and recall when the DSC-C2f module is at different positions in the network structure.

Figure 9 shows a comparison plot of the effective receptive field visualization for each C2f module in the network backbone, where we introduce the effective receptive field (erf) visualization method43–44. As shown in the figure, we compare the effective erf sizes of the original C2f modules in each layer of the network backbone with our DSC-C2f modules, and for the replaced DSC-C2f modules, the erf size is smaller than that of the baseline network. Generally, the smaller the receptive field is, the more local and detailed the features tend to be. Consequently, our DSC-C2f module excels in capturing local features of the input image, enhancing the network’s ability to discern local patterns and structures.

Comparison of effective receptive fields (erf). Visual comparison of the effective receptive field of the DSC-C2f module and the C2f module.

In this section, we conduct comparative experiments on different attention mechanisms embedded in network models to further validate the effectiveness of the proposed WCA module.

We chose YOLOv8s as a benchmark model to compare the performance of the WCA module with X/Y weights added for model performance improvement. The experimental results are shown in Table 3. Since our WCA module was designed inspired by the CA module, it can be seen that in the fact that no weights are added, the performance of both performs almost the same. And the performance of the model improves significantly when weights in the horizontal (X) and vertical (Y) directions are added separately20, as well as when both are added. The results of the visualization are shown in Fig. 10, it can be seen that with the addition of the X weights alone, the model is more sensitive to the horizontal direction, and the activation value of the heat map is significantly higher, indicating that the region receives more attention in the x-direction. Similarly, with the addition of Y weights, the model has higher activation values for the vertical region heat map.

Comparison of heat map results for WCA with the addition of different directional weights. The heatmaps were created by Grad-CAM45. It is clear that WCA, with the addition of horizontal (X) and vertical (Y) direction weights, the model pays more attention to fracture features.

Meanwhile, we select different attention mechanisms to compare with WCA, and further verify the effectiveness by adding SE19, CBAM20, GAM21, and CA22. The experimental results are presented in Table 4. It can be seen that the parameter proliferation of the model after integrating GAM and CBAM fails to satisfactorily improve the detection accuracy. On the other hand, SE and CA achieve significant accuracy improvement with minimal parameter increment. However, their efficacy in capturing fracture features seems to be somewhat limited, as shown in the heat map in Fig. 11. SE occasionally fails to capture certain features, while CA, due to its intrinsic properties, sometimes exceeds the specified concentration range. On the contrary, despite the increase in parameters and the negligible increase in computational cost, the accuracy of WCA is significantly improved by 5.4% compared to the baseline network without the attention mechanism.

Results of our heatmap visualization of different attentional mechanisms on the FracAtlas and GRAZPEDWRI-DX datasets. It is clear that our WCA can localize objects of interest more accurately than other attention methods.

In addition, we also conducted comparative experiments for different attention mechanisms in the benchmark network after adding the DSC-C2f module, as shown in Table 5. The experiments show that after feature extraction with the DSC-C2f module, except for the model with the addition of the WCA module, which has a 3.1% improvement in mAP, the mAP values of the models with the addition of the other modules decreased. In terms of the precision and recall metrics, except for the model with the addition of the GAM module, the metrics of all the other models improved. In contrast, the WCA attentional mechanism, which outperforms other attentional mechanisms in all metrics, is more likely to perform well in fracture detection tasks.

To demonstrate the effectiveness of the proposed WCAY algorithm for fracture detection in X-ray images, we conducted a series of comparative experiments. We have selected several state-of-the-art object detection methods for our experiments, including the YOLO series, the DETR46 series, and other single-stage detection models47,48, of which we have set up different sizes such as nano, small, and medium for the YOLO series.

It is worth noting that in order to test the training parameters of the DETR series model during the testing process differently from the YOLO series, we use the official default parameters and pre-training weights file, with the batch size set to 8, and the input size of (974, 800).

As seen from the results in Table 6, our algorithm has a positive effect on the detection performance, with the mAP at the highest value under each model size. For the nano size, the mAP of WCAY-n reaches 47.2%, which is 4.9% higher than the 42.3% of YOLOv8-n, which is the highest mAP among the other models. Our model also achieves the best results for model comparisons with parameter counts of 30 M or more, and has the highest mAP value in comparison to the DETR series, which has more than 33% higher parameter counts and computational effort. Our model also achieves the best performance among the models with the same single-stage detection. Surprisingly, among the small models, our WCA achieves the highest mAP value of all models, 56.7%, which is 5.9%, 5.7%, and 7.2% higher than the models YOLOv8, RT-DETR49, and FreeAnchor48, respectively, which have the highest mAP values among the other algorithm series.

It is essential to highlight that transitioning from small to medium in size leads to a decrease in the mAP. This decline can be attributed to the larger model size necessitating higher-resolution input images and larger datasets. However, given the standardized input image size of 640, medium-sized models and larger models are susceptible to overfitting on our dataset. This is crucial for the DETR family of models, which is why pre-training weights need to be added during the training process. Consequently, it becomes evident that the small model size is the most suitable for our detection task.

To validate the versatility of our model, we conducted comparative experiments across multiple categories using the GRAZPEDWRI-DX dataset. The results, depicted in Figs. 12 and 13, reveal WCAY’s superior mAP across various real-time detection algorithms. However, our algorithms exhibit slightly lower accuracy in detecting the “bone anomaly” and “soft tissue” categories. Nonetheless, for categories such as “fracture,” “text,” “foreignbody,” “periostealreaction,” and “pronatorsign,” our algorithms demonstrated the highest mAP. Notably, the “bonelesion” category consistently maintains a high AP value across different models, particularly in nano and small models, providing remarkable detection results.

Comparison of the detection results of different real-time detection algorithms, in different categories, on the GRAZPEDWRI-DX dataset.

In conclusion, our algorithm consistently outperforms other models in terms of detection accuracy across both datasets, despite the increased number of parameters and computational load required to maintain this accuracy. Our experiments demonstrate the robust performance of WCAY compared to that of other object detection networks, demonstrating its strong generalizability and effectiveness in tackling the task of X-ray image fracture detection.

Comparison of the mAP results of different real-time detection algorithms on the GRAZPEDWRI-DX dataset.

To clearly demonstrate the efficacy of the WCAY model, in addition to performing inference detection on two X-ray fracture detection datasets, FracAtlas and GRAZPEDWRI-DX, we also perform inference detection on datasets with similarities to the X-ray images, NEU-DET52 and SSDD53 public datasets. The WCAY model allows for better detection of objects in images from different domains and angles to detect objects in images, including objects with random orientations and different scales. The detection results are visualized in Fig. 14.

The figure shows some qualitative results of the WCAY algorithm proposed in this paper on four datasets.

As seen from the figure, on the FracAtlas dataset, our model can clearly detect and localize the fracture region in the X-ray image, and the detection results show a high confidence level; on the GRAZPEDWRI-DX dataset, our model can also detect the features of skeletal disorders in addition to the fracture features; and on the NEU-DET and SSDD datasets, our model also perfectly detects the corresponding targets. set, our model can also detect the corresponding targets perfectly, and the accurate localization and identification of target detection in the displayed images prove the effectiveness of the WCAY algorithm in various types of challenging image detection.

In this paper, we propose a new algorithm, WCAY, for fracture detection in different parts of X-ray images. To improve the accuracy of the model in detecting fracture features, we introduce the DSConv module to improve the C2f module and propose a new core operator, DSC-C2f. We also introduce an attention mechanism to improve the model’s new energy. In addition, we design a new channel attention mechanism (WCA), which is more effective at capturing long-range dependent information. The experimental results of the proposed WCAY model on the X-ray fracture detection dataset show that it has advantages over some mainstream real-time object detection methods. It performs well in terms of evaluation metrics, reaching the SOTA (State of the art) level, e.g., precision, recall, and mAP. Specifically, the WCAY model improves the mAP in the FracAtlas dataset by 8.8% compared to the baseline model for small model sizes, while the mAP in the GRAZPEDWRI-DX dataset for all categories improves by 1.1%, and for the fracture category therein the mAP reaches 93.9%, proving its X-ray image capability in the task of excellent fracture detection in X-ray imaging.

The datasets analyzed during the current study are available at Figshare under https://figshare.com/articles/dataset/The_dataset/22363012(FracAtlas) and https://figshare.com/articles/dataset/GRAZPEDWRI-DX/14825193(GRAZPEDWRI-DX). Both datasets are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) (https://creativecommons.org/licenses/by/4.0/). The implementation code and trained models for this study can be found on GitHub at https://github.com/cccp421/Fracture-Detection-WCAY. Including the datasets used in this experiment, the provenance can be found at this URL.

Forriol, F. & Mazzola, A. Bone fractures: Generalities. Textbook Musculoskeletal Disorders.https://doi.org/10.1007/978-3-031-20987-1_28 (2023).

Article Google Scholar

Venneri, F. et al. Safe surgery saves lives. Textbook of Patient Safety and Clinical Risk Management.https://doi.org/10.1007/978-3-030-59403-9_14 (2021).

Article Google Scholar

Lisacek-Kiosoglous, A. B. et al. Artificial intelligence in orthopedic surgery: exploring its applications, limitations, and future direction. Bone Joint Res. 12, 447–454. https://doi.org/10.1302/2046-3758.127.BJR-2023-0111.R1 (2023).

Xu, F. et al. Deep learning-based artificial intelligence model for classification of vertebral compression fractures: A multicenter diagnostic study. Front. Endocrinol.https://doi.org/10.3389/fendo.2023.1025749 (2023).

Article Google Scholar

Ju, R. Y. & Cai, W. Fracture detection in pediatric wrist trauma X-ray images using YOLOv8 algorithm. Sci Rep.https://doi.org/10.1038/s41598-023-47460-7 (2023).

Article PubMed PubMed Central Google Scholar

Thian, Y. L. et al. Convolutional neural networks for automated fracture detection and localization on wrist radiographs. Radiology: Artificial Intelligence. https://doi.org/10.1148/ryai.2019180001 (2019).

Zhao, Z. Q., Zheng, P., Xu, S. T. & Wu, X. D. Object detection with deep learning: A review. IEEE Transactions on Neural Networks and Learning Systems. 30, 3212–3232. https://doi.org/10.1109/TNNLS.2018.2876865 (2019).

Article PubMed Google Scholar

L. Jiao. et al. A survey of deep learning-based object detection. IEEE Access. 7, 128837–128868. https://doi.org/10.1109/ACCESS.2019.2939201 (2019).

Arkin, E., Yadikar, N., Muhtar, Y., Ubul, K. A survey of object detection based on CNN and transformer. in IEEE International Conference on Pattern Recognition and Machine Learning (PRML) 99–108. https://doi.org/10.1109/PRML52754.2021.9520732 (2021).

Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. in IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Preprint at https://arxiv.org/abs/1311.2524 (2014).

Girshick, R. Fast r-cnn. in IEEE International Conference on Computer Vision (ICCV) 1440–1448. Preprint at https://arxiv.org/abs/1504.08083 (2015).

Ren S, He K, Girshick R, Sun J. Faster r-cnn: Toward real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. Preprint at https://arxiv.org/abs/1506.01497 (2015).

Hou, L., Lu, K. & Xue, J. Refined one-stage oriented object detection method for remote sensing images. IEEE Transactions on Image Processing. 31, 1545–1558. https://doi.org/10.1609/aaai.v33i01.33018577 (2022).

Article ADS PubMed Google Scholar

Redmon et al. You Only Look Once: Unified, Real-Time Object Detection. Preprint at https://arxiv.org/abs/1506.02640 (2016).

Tsung, Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal Loss for Dense Object Detection. Preprint at https://arxiv.org/abs/1708.02002 (2018).

Niu, Z., Zhong, G. & Yu, H. A review on the attention mechanism of deep learning. Neurocomputing. 452, 48–62. https://doi.org/10.1016/j.neucom.2021.03.091 (2021).

Article Google Scholar

Galassi, A., Lippi, M. & Torroni, P. Attention in natural language processing. IEEE Transactions on Neural Networks and Learning Systems. 32, 4291–4308. https://doi.org/10.1109/TNNLS.2020.3019893 (2021).

Article PubMed Google Scholar

Wan, D. H. et al. Mixed local channel attention for object detection. Eng. Appl. Artif. Intell. 123. https://doi.org/10.1016/j.engappai.2023.106442 (2023).

Jie, H, Li, S, Gang, S. Squeeze-and-excitation networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 7132–7141, Preprint at https://arxiv.org/abs/1709.01507v4 (2019).

Woo, S. et al. Cbam: Convolutional block attention module. in European Conference on Computer Vision (ECCV) 3–19, Preprint at http://arxiv.org/abs/1807.06521 (2018).

Liu, Y., Shao, Z., Hoffmann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions. Preprint at https://arxiv.org/abs/2112.05561v1 (2021).

Hou Q, Zhou D, Feng J. Coordinate attention for efficient mobile network design. in IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13713–13722, Preprint at https://arxiv.org/abs/2103.02907v1 (2021).

Abedeen et al. FracAtlas: A dataset for fracture classification, localization and segmentation of musculoskeletal radiographs. Sci. Data. 10, 521. https://doi.org/10.1038/s41597-023-02432-4 (2023).

Nagy, E. et al. A pediatric wrist trauma X-ray dataset (GRAZPEDWRI-DX) for machine learning. Sci Data.Bold">9, 222. https://doi.org/10.1038/s41597-022-01328-z (2022).

Article PubMed PubMed Central Google Scholar

Qi, Y., He, Y., Qi, X., Zhang, Y., Yang, G. Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation. in IEEE/CVF International Conference on Computer Vision (ICCV) 6047–6056. https://doi.org/10.1109/ICCV51070.2023.00558 (2023).

Guan, B., Zhang, G., Yao, J., Wang, X., Wang, M. Arm fracture detection in X-rays based on improved deep convolutional neural network. Comput. Electr. Eng. 81. https://doi.org/10.1016/j.compeleceng.2019.106530 (2020).

Rajpurkar, P. et al. Mura dataset: Toward radiologist-level abnormality detection in musculoskeletal radiographs. Preprint at https://arxiv.org/abs/1712.06957v4 (2017).

Yahalomi, E., Chernofsky, M. & Werman, M. Detection of distal radius fractures trained by a small set of X-ray images and faster R-CNN. Intell. Syst. Comput. 997. https://doi.org/10.1007/978-3-030-22871-2_69 (2019).

Wang, M. et al. ParallelNet: Multiple backbone network for detection tasks on thigh bone fracture. Multimedia Systems. 27, 1091–1100. https://doi.org/10.1007/s00530-021-00783-9 (2021).

Article Google Scholar

Krogue, J. D. et al. Automatic hip fracture identification and functional subclassification with deep learning. Radiol. Artif. Intell. 2. https://doi.org/10.1148/ryai.2020190023 (2020).

Li, Y.-C. et al. Can a deep-learning model for the automated detection of vertebral fractures approach the performance level of human subspecialists? Clinical Orthoped and Related Research. 479, 1598–1612. https://doi.org/10.1097/CORR.0000000000001685 (2021).

Yuan, G., Liu, G., Wu, X., Jiang, R. An improved YOLOv5 for skull fracture detection. Exploration of novel intelligent optimization algorithms. Communications in Computer and Information Science 1590. https://doi.org/10.1007/978-981-19-4109-2_17 (2022).

Warin, K. et al. Maxillofacial fracture detection and classification in computed tomography images using convolutional neural network-based models. Sci. Rep. 13, 3434. https://doi.org/10.1038/s41598-023-30640-w (2023).

Article ADS PubMed PubMed Central Google Scholar

Fatima, J. et al. Vertebrae localization and spine segmentation on radiographic images for feature‐based curvature classification for scoliosis. Concurrency and Computation: Practice and Experience. 34. https://doi.org/10.1002/cpe.7300 (2022).

Dibo, R. et al. DeepLOC: Deep learning-based bone pathology localization and classification in wrist X-ray images. Analysis of Images, Social Networks and Texts. 14486. https://doi.org/10.1007/978-3-031-54534-4_14 (2024).

Ju, R. Y. & Cai, W. Fracture detection in pediatric wrist trauma X-ray images using YOLOv8 algorithm. Sci. Rep. 13, 20077. https://doi.org/10.1038/s41598-023-47460-7 (2023).

Article ADS PubMed PubMed Central Google Scholar

Tanzi, L., Vezzetti, E., Moreno, R. & Moos, S. X-ray bone fracture classification using deep learning: A baseline for designing a reliable approach. Applied Sciences. 10, 1507. https://doi.org/10.3390/app10041507 (2020).

Article Google Scholar

Jocher, G. et al. Ultralytics YOLO. GitHub https://github.com/ultralytics/ultralytics (2023).

Jocher, G. et al. YOLOv5 by Ultralytics. GitHub. https://doi.org/10.5281/zenodo.3908559 (2020).

Ouyang, D. et al. Efficient multi-scale attention module with cross-spatial learning. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096516 (2023).

Buslaev, A. et al. Albumentations: Fast and flexible image augmentations. Information. 11, 125. https://doi.org/10.3390/info11020125 (2020).

Article Google Scholar

Boyd, K., Eng, K. H., Page, C. D. Area under the precision-recall curve: Point estimates and confidence intervals. Machine learning and knowledge discovery in databases. Lecture Notes in Computer Science. 8190. https://doi.org/10.1007/978-3-642-40994-3_29 (2013).

Luo, W. et al. Understanding the effective receptive field in deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 29, https://proceedings.neurips.cc/paper/2016/hash/c8067ad1937f728f51288b3eb986afaa-Abstract.html (2016).

Shi, D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers. Preprint at https://arxiv.org/abs/2311.17132 (2023).

Selvaraju et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. in Proceedings of the IEEE International Conference on Computer Vision 618–626, Preprint at https://arxiv.org/abs/1610.02391v4 (2017).

Carion, N., Massa, F., Synnaeve, G. et al. End-to-end object detection with transformers. Computer Vision—ECCV 2020 (ECCV 2020). vol 12346. https://doi.org/10.1007/978-3-030-58452-8_13 (2020).

Feng, C., Zhong, Y., Gao, Y. et al. Tood: Task-aligned one-stage object detection. in International Conference on Computer Vision (ICCV). IEEE Computer Society 3490–3499. https://doi.org/10.1109/ICCV48922.2021.00349 (2021).

Zhang, X., Wan, F., Liu, C. et al. Freeanchor: Learning to match anchors for visual object detection. Advances in Neural Information Processing Systems. 32. https://doi.org/10.48550/arXiv.1909.02466 (2019).

Zhao, Y., Lv, W., Xu, S. et al. Detrs beat yolos on real-time object detection. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2024) 16965–16974. https://doi.org/10.48550/arXiv.2304.08069 (2024).

Liu, S., Li, F., Zhang, H. et al. Dab-detr: Dynamic anchor boxes are better queries for detr. Preprint at. https://doi.org/10.48550/arXiv.2201.12329 (2022).

Meng, D., Chen, X., Fan, Z. et al. Conditional detr for fast training convergence. in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV2021) 3651–3660. https://doi.org/10.48550/arXiv.2108.06152 (2021).

Zhao, W. D. et al. A new steel defect detection algorithm based on deep learning. Computational Intelligence and Neuroscience 1–13. https://doi.org/10.1155/2021/5592878 (2021).

Wang, Y. Y. et al. A SAR dataset of ship detection for deep learning under complex backgrounds. Remote Sensing. 11, 765. https://doi.org/10.3390/rs11070765 (2019).

Li, C. Y. et al. YOLOv6 by Meituan. GitHub https://github.com/meituan/YOLOv6 (2022).

Download references

These authors contributed equally: Wenbin Lu and Fangpeng Lu.

Heilongjiang University, Harbin, 150080, China

Peng Chen, Songyan Liu, Wenbin Lu, Fangpeng Lu & Boyang Ding

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

P.C. is mainly responsible for writing the manuscript and conducting experiments throughout the entire research. S.L. is responsible for the overall direction and supervision of the paper. W.L. and F.L. are responsible for the overall layout of the paper and embellishment. B.D. is responsible for project management and coordination to ensure that the project schedule meets expectations. All authors reviewed the manuscript.

Correspondence to Songyan Liu.

The authors declare no competing interests.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Below is the link to the electronic supplementary material.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

Chen, P., Liu, S., Lu, W. et al. WCAY object detection of fractures for X-ray images of multiple sites. Sci Rep 14, 26702 (2024). https://doi.org/10.1038/s41598-024-77878-6

Download citation

Received: 17 April 2024

Accepted: 25 October 2024

Published: 04 November 2024

DOI: https://doi.org/10.1038/s41598-024-77878-6

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative