You Only Look Once (YOLO): Unified, Real-time Object Detection

December 15, 2024

5 mins read

Object detection is a fundamental task in computer vision that involves identifying and locating objects of interest within an image or video. Unlike image classification, which assigns a single label to an entire image, object detection provides both the category and the precise location of objects by outputting bounding boxes around them. This dual capability makes it a crucial technology for a wide range of applications, including autonomous driving, medical imaging, and augmented reality.

Authors: Alisia Picciano, Giorgio Micaletto, Martin Patrikov

YOLO: A game-changer in object detection

YOLO is a significant improvement over older R-CNN architectures due to its unified approach to object detection. Unlike R-CNN, which first generates region proposals and then classifies them, YOLO processes the entire image in a single forward pass of the network, allowing it to achieve real-time detection with high accuracy, making it faster and more efficient for applications requiring instantaneous object localization and classification. The efficiency comes from its ability to divide the image into a grid and simultaneously predict bounding boxes, confidence scores and class probabilities for multiple objects in each grid cell. This approach not only streamlines the detection process but also reduces computational overhead compared to region-based methods. By treating detection as a regression problem, YOLO avoids the need for overcomplicated pipelines, making it well suited for applications such as autonomous driving, surveillance and robotics.

How YOLO works

Detection

YOLO approaches object detection with a unique methodology that reasons globally across the entire image, encoding contextual information during both training and inference phases. This innovation significantly reduces false positives and background errors, as YOLO achieves an error rate of less than 50% compared to its predecessors. The detection process begins by dividing the input image into an SxS grid. Each grid cell is tasked with detecting objects whose centers fall within its boundaries. This is where YOLO’s hallmark efficiency comes into play. Each cell predicts multiple bounding boxes (typically B = 2 ), accompanied by confidence scores. The scores reflect the probability of an object being present and the accuracy of the bounding box localization. Each bounding box prediction includes five parameters: x , y , w , h , and the confidence score. Here, x and y represent the box’s center coordinates relative to the grid cell, while w and h denote the bounding box’s width and height relative to the entire image. The final predictions are refined by multiplying class probabilities with the confidence scores, yielding class-specific confidence scores for each bounding box.

Network Design

The original YOLO architecture, which makes the above described detection possible, was introduced by Joseph Radmon and colleagues in 2016, who named it DarkNet architecture. Its network design is inspired by the GoogLeNet architecture but tailored for end-to-end object detection, incorporating both feature extraction and bounding box prediction in a unified pipeline. At its core, the YOLO network consists of 24 convolutional layers followed by 2 fully connected layers. The convolutional layers apply filters to extract spatial and semantic features from the input image, while the fully connected layers combine features globally to predict bounding box coordinates, class probabilities, and confidence scores; the final output.

The input layer has size 448x448x3, where the first two numbers indicate the width and the height of the image, while the last number stands for the number of channels, in this case the three RGB colours. The convolutional layers are combined with maxpooling layers, which take care of the downsampling while preserving the most important information. As the data gets passed on between the layers the number of channels grows (because at each layer more information is extracted from the image), and step-by-step the model sees more and more of the whole image. Each cell predicts a fixed number of bounding boxes (typically B = 2) along with a confidence score indicating the likelihood that a box contains an object and the accuracy of its localisation.The final output is a 7x7x30 tensor of predictions for B=2 and C=20 (where C stands for the number of classes). This means that the output image is divided into 7×7=49 regions, and for each region there are 5 coordinates (x, y, w, h, and the probability of an object being there), and this is given for both of the bounding boxes for every region, plus there is a probability for each of the 20 classes.

Loss function

A critical element of YOLO’s design is its loss function, which optimizes the trade-off between localization, confidence, and classification accuracy. The loss function is based on Sum of Squared Errors (SSE), chosen for its simplicity and ease of optimization. However, this choice introduces challenges, such as treating localization errors and classification errors equally, which can destabilize the model during training. To address this, YOLO incorporates a balancing mechanism in its loss function. It assigns higher weights ( \(\lambda_{\text{coord}} = 5\) ) to the loss from bounding box coordinate predictions, ensuring precise localization. Conversely, it reduces the weight ( \(\lambda_{\text{noobj}} = 0.5\) ) for grid cells without objects, preventing these cells from overpowering the training signal from those containing objects. Additionally, YOLO’s loss function penalizes classification errors only for the grid cell responsible for the object, determined by the highest Intersection over Union (IoU) with the ground truth. This design ensures that each bounding box predictor specializes in specific shapes and sizes, improving recall and overall robustness.

Limitations

Despite its revolutionary impact, YOLO is not without limitations. One major drawback is its inability to handle multiple objects within a single grid cell. Since each cell predicts only one set of class probabilities, it can struggle with small or overlapping objects, such as a car and a pedestrian in close proximity.

Another limitation arises from the training process itself. The model is encouraged to predict the class of the object with the strongest presence, introducing bias. This can result in the suppression of smaller or less dominant objects within the same grid cell.

Why was YOLO revolutional?

YOLO revolutionised object detection by offering an unprecedented combination of speed, simplicity, and global reasoning. Unlike earlier systems, YOLO abandoned the disjointed, multi-stage pipelines like those used in Deformable Parts Models (DPM) and Region-based CNNs (R-CNN), and instead introduced a unified framework. This allowed for real-time processing speeds, with the base version achieving 45 fps and the fast version exceeding 150 fps, enabling tasks like real-time streaming video with latency under 25 ms.

Unlike older methods, such as R-CNN or Selective Search-based models, which operated in parts or relied on region proposals, YOLO analysed the entire image during both training and testing. This global perspective reduced background errors and improved context understanding, making it less prone to failure in new domains. This was a game-changer compared to models like Faster R-CNN, which, although faster than R-CNN, still couldn’t match YOLO’s speed or real-time applicability. YOLO’s generalisability was proven by applying it to artworks, where its AP degrades the least compared to other models.

While YOLO’s accuracy lagged slightly behind models like Fast R-CNN, it struck a perfect balance between performance and efficiency. YOLO’s strengths were further demonstrated in ensemble setups, where it could combine with systems like Fast R-CNN to leverage its background error filtering capabilities, boosting overall mean average precision (mAP) with negligible computational overhead.

The future of object detection (with YOLO)

Since its introduction, YOLO has revolutionized object detection by shifting the paradigm from multi-stage pipelines, like those in R-CNN, to single-stage models that achieve real-time performance without sacrificing accuracy. This innovation has inspired a wave of research, resulting in iterative improvements of the YOLO architecture, with the latest iteration being YOLOv8. The future of YOLO and object detection lies in addressing current challenges, like handling small, overlapping objects and achieving even greater efficiency. Innovations like self-supervised learning and transformer-based models may also influence the evolution of YOLO, further bridging the gap between performance and speed in the coming years.

Implementation

Below, the code using the fifth iteration of YOLO, processing frames captured from a webcam. The model detects objects in each frame, draws bounding boxes with labels and confidence scores and display the results in a live video feed.