Papers
arxiv:2504.13099

RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

Published on Apr 17
· Submitted by RanjanSapkota on Apr 22
Authors:
,
,

Abstract

This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs

Community

Paper author Paper submitter

This study presents a comprehensive comparison between RF-DETR object detection and YOLOv12 object detection models for greenfruit recognition in complex orchard environments characterized by label ambiguity, occlusion, and background camouflage. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under real-world conditions. The RF-DETR object detection model, leveraging a DINOv2 backbone with deformable attention mechanisms, excelled in global context modeling, which proved particularly effective for identifying partially occluded or visually ambiguous greenfruits. Conversely, the YOLOv12 model employed CNN-based attention mechanisms to enhance local feature extraction, optimizing it for computational efficiency and edge deployment suitability. In the single-class detection scenarios, RF-DETR achieved the highest mean Average Precision (mAP@50) of 0.9464, showcasing its robust capability to accurately localize greenfruits within cluttered scenes. Despite YOLOv12N achieving the highest mAP@50:95 of 0.7620, RF-DETR object detection model consistently outperformed in managing complex spatial scenarios. In multi-class detection, RF-DETR again led with an mAP@50 of 0.8298, demonstrating its effectiveness in distinguishing between occluded and non-occluded fruits, whereas YOLOv12L topped the mAP@50:95 metric with 0.6622, indicating superior classification under detailed occlusion conditions. The analysis of model training dynamics revealed RF-DETR’s rapid convergence, particularly in single-class scenarios where it plateaued at fewer than 10 epochs, underscoring the efficiency and adaptability of transformer-based architectures to dynamic visual data. These results confirm RF-DETR’s suitability for accuracy-critical agricultural tasks, while YOLOv12 remains ideal for speed-sensitive deployments.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.13099 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.13099 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.13099 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.