You Only Look Once (YOLO): Unified, Real-time Object Detection
Object detection is a fundamental task in computer vision that involves identifying and locating objects of interest within an image or video. Unlike image classification, which assigns a single label to an entire image, object detection provides both the category and the precise location of objects by outputting bounding boxes around them. This dual capability makes it a crucial technology for a wide range of applications, including autonomous driving, medical imaging, and augmented reality.
Authors: Alisia Picciano, Giorgio Micaletto, Martin Patrikov
YOLO: A game-changer in object detection
YOLO is a significant improvement over older R-CNN architectures due to its unified approach to object detection. Unlike R-CNN, which first generates region proposals and then classifies them, YOLO processes the entire image in a single forward pass of the network, allowing it to achieve real-time detection with high accuracy, making it faster and more efficient for applications requiring instantaneous object localization and classification. The efficiency comes from its ability to divide the image into a grid and simultaneously predict bounding boxes, confidence scores and class probabilities for multiple objects in each grid cell. This approach not only streamlines the detection process but also reduces computational overhead compared to region-based methods. By treating detection as a regression problem, YOLO avoids the need for overcomplicated pipelines, making it well suited for applications such as autonomous driving, surveillance and robotics.How YOLO works
Detection
YOLO approaches object detection with a unique methodology that reasons globally across the entire image, encoding contextual information during both training and inference phases. This innovation significantly reduces false positives and background errors, as YOLO achieves an error rate of less than 50% compared to its predecessors. The detection process begins by dividing the input image into an SxS grid. Each grid cell is tasked with detecting objects whose centers fall within its boundaries. This is where YOLO’s hallmark efficiency comes into play. Each cell predicts multiple bounding boxes (typically B = 2 ), accompanied by confidence scores. The scores reflect the probability of an object being present and the accuracy of the bounding box localization. Each bounding box prediction includes five parameters: x , y , w , h , and the confidence score. Here, x and y represent the box’s center coordinates relative to the grid cell, while w and h denote the bounding box’s width and height relative to the entire image. The final predictions are refined by multiplying class probabilities with the confidence scores, yielding class-specific confidence scores for each bounding box.Network Design
The original YOLO architecture, which makes the above described detection possible, was introduced by Joseph Radmon and colleagues in 2016, who named it DarkNet architecture. Its network design is inspired by the GoogLeNet architecture but tailored for end-to-end object detection, incorporating both feature extraction and bounding box prediction in a unified pipeline. At its core, the YOLO network consists of 24 convolutional layers followed by 2 fully connected layers. The convolutional layers apply filters to extract spatial and semantic features from the input image, while the fully connected layers combine features globally to predict bounding box coordinates, class probabilities, and confidence scores; the final output.Loss function
A critical element of YOLO’s design is its loss function, which optimizes the trade-off between localization, confidence, and classification accuracy. The loss function is based on Sum of Squared Errors (SSE), chosen for its simplicity and ease of optimization. However, this choice introduces challenges, such as treating localization errors and classification errors equally, which can destabilize the model during training. To address this, YOLO incorporates a balancing mechanism in its loss function. It assigns higher weights ( \(\lambda_{\text{coord}} = 5\) ) to the loss from bounding box coordinate predictions, ensuring precise localization. Conversely, it reduces the weight ( \(\lambda_{\text{noobj}} = 0.5\) ) for grid cells without objects, preventing these cells from overpowering the training signal from those containing objects. Additionally, YOLO’s loss function penalizes classification errors only for the grid cell responsible for the object, determined by the highest Intersection over Union (IoU) with the ground truth. This design ensures that each bounding box predictor specializes in specific shapes and sizes, improving recall and overall robustness.Limitations
Despite its revolutionary impact, YOLO is not without limitations. One major drawback is its inability to handle multiple objects within a single grid cell. Since each cell predicts only one set of class probabilities, it can struggle with small or overlapping objects, such as a car and a pedestrian in close proximity.
Another limitation arises from the training process itself. The model is encouraged to predict the class of the object with the strongest presence, introducing bias. This can result in the suppression of smaller or less dominant objects within the same grid cell.
Why was YOLO revolutional?
YOLO revolutionised object detection by offering an unprecedented combination of speed, simplicity, and global reasoning. Unlike earlier systems, YOLO abandoned the disjointed, multi-stage pipelines like those used in Deformable Parts Models (DPM) and Region-based CNNs (R-CNN), and instead introduced a unified framework. This allowed for real-time processing speeds, with the base version achieving 45 fps and the fast version exceeding 150 fps, enabling tasks like real-time streaming video with latency under 25 ms.
Unlike older methods, such as R-CNN or Selective Search-based models, which operated in parts or relied on region proposals, YOLO analysed the entire image during both training and testing. This global perspective reduced background errors and improved context understanding, making it less prone to failure in new domains. This was a game-changer compared to models like Faster R-CNN, which, although faster than R-CNN, still couldn’t match YOLO’s speed or real-time applicability. YOLO’s generalisability was proven by applying it to artworks, where its AP degrades the least compared to other models.