Image Quality Assessment: From Error Visibility to Structural Similarity

October 9, 2023

6 mins read

Assessing the quality of an image

Since the very early stage of the internet, giving a score to the quality of an image (or a video) has always been central to the functioning of websites, applications, and streaming services. Although for humans it is a very simple and intuitive procedure, teaching a computer how to assess the quality of images in a way that resembles human performance is a great challenge.

There are many approaches to performing such task, but in the early 2000s the most promising results were coming from using “full reference” algorithms: when comparing two images, one of them is assumed to be 100% quality (the reference image) and the other to be the one degraded. Then the quality of the degraded image relative to the reference is computed.

The traditional approach: error visibility

From the perspective of a computer scientist, the most intuitive approach to teach a computer how to address the quality of images is probably to compare them pixel by pixel and compute the average of the differences (the “errors”) in pixel values. This procedure, called Mean Squared Error, is a good starting point, but it is based on a flawed assumption: it gives the same importance to every pixel, whereas some of them might be more important than others (like the central ones or the most luminous).

So, a desired improvement would be to find a way to assign a weight to each pixel such that it resembles the behavior of the human visual system (HVS).

Many researchers chose visibility as the criterion to assign weights to the pixels, giving birth to a bottom-up approach based on error visibility.

The new paradigm: structural similarity

Until the early 2000s, this was the main trend of full reference algorithms, but it relied on some wrong assumptions and generalizations.

First of all, it’s not generally true that the visibility of an error is always related to its impact on the quality. For example, scaling all pixels by a constant value (even the most visible ones) will give an extremely high error, but the image is still able to convey the original information almost entirely.

Moreover, algorithms based on error visibility mimic a very low-level activity of our visual system: recognizing colors and giving more importance to some elements compared to others.

However, they ignore the real, high-level goal of our visual system, which is obtaining information from the observed objects.

For the sake of illustration, consider the image below: would you say it’s a high-quality image?

If you’ve never seen this image before, chances are it will seem like a set of random black dots assembled in a very low-quality way. But what if this image represented a Dalmatian dog sniffing the ground? The perception of the quality changes completely because one is now able to understand the structural information behind those (apparently) random black dots.

Thus, if we assume that the human visual system is highly adapted to detect the structural information present in an image, one could conclude that such structural information is a key factor in determining its quality.

So, instead of comparing two images pixel by pixel and somehow averaging the errors, a better approach could be to find a way to evaluate how similar is the information conveyed by two images and use this value to assess the quality loss. But how can it be expressed mathematically?

Zhou Wang, Alan Bovik, Hamid Sheikh, and Eero Simoncelli: researchers at the IEE, in 2004 published a paper about a full-reference algorithm based on “structural similarity” (similarity in the overall structure of two images).

Full reference algorithms were not anything new at the time, what was different about this paper was the new approach to the problem: the new top-down philosophy was based on an assumption that generalized the overall functioning of the Human Visual System.

The difference from the previous approach stands in the fact that the image degradation is not quantified by the errors between the two images, but by perceived changes in the structural information.

Calculating structural similarity

The overall similarity of the two images is obtained by comparing their similarity in luminance, contrast, and structure, normalized in a way such that each of the 3 components is mathematically independent of the others.

The algorithm works locally on a small portion of the image (e.g. an 8×8 pixels window) and the final result is an average of the scores of all local results.

First of all, the algorithm ensures the two images are of the same dimensions and it converts them to grayscale so that the intensity of each pixel is represented by a single numerical ranging from 0 to 255 (from black to white).

As said before, the fundamental assumption on which the SSIM orbits around is that the human visual system has evolved to detect structural patterns, which ultimately are what give the meaning of an image (remember the example of the Dalmatian dog).

The other fundamental concept is that the structure of an object is independent of luminance and contrast (until the saturation point). Picture a reflective surface such as a glass vase: the structure of the object is physically independent from the illumination in the scene.

The local luminance is obtained by computing the mean intensity 𝞵 of the 8×8 pixel window; the standard deviation 𝝈 of the two signals is then calculated to estimate the local contrast in the 8×8 window.

After luminance and contrast have been defined, the value of the contrast is then isolated by subtracting the mean from the original signal and normalizing all for the standard deviation.

Now that the information for the distorted and reference signal have been isolated and defined, the only remaining step is to compare them with each other to finally assess a similarity score.

Before going into the math behind the model, let’s imagine the following situation: in a very dark room where nobody sees anything, someone lights up a matchstick. It’s reasonable to say that everyone in that room is going to notice the light change.

But now let’s switch the settings to Times Square at 1 p.m. and imagine what result the same action would produce: it’s safe to say that nobody would ever notice the change in light.

Generalizing this phenomenon, it is possible to conclude that the human body is way more sensitive to the amount of relative change in stimulus rather than to the absolute change: this statement describes a key concept in the functioning of the SSIM model: the Weber’s Law.

The relative change in luminance, contrast, and structure is more relevant than the absolute change.

It then derives that the mathematical formulae used to assess the amount of change between the two images are just an implementation of this key concept.

Luminance comparison function:

Contrast comparison function:

(where 𝞵 is the mean intensity, 𝝈 is the standard deviation, x is a portion of the reference image and y is a portion of the distorted one, and C is just a small constant for when the denominator approaches zero). Visually it’s possible to see that both functions have the same structure: 2ABA2 + B2 Last, by doing the same structure isolation process as before (subtracting the signals by the mean and normalizing their standard deviation) it derives the following structure comparison function:

Last, by doing the same structure isolation process as before (subtracting the signals by the mean and normalizing their standard deviation) it derives the following structure comparison function:

It’s now ultimately possible to construct and understand the final form of the Structural SIMilarity function, which is just a combination of the three previously seen functions:

Where alpha>0 beta>0 gamma> 0 are used to represent the relative importance of each component. Since the SSIM index works locally on small portions of the image (e.g. 8×8 pixels), to get the total similarity measure it’s necessary to compute the mean of all the SSIM values found. This final form gets the name of Mean Structural SIMilarity: MSSIM.

Analysis of AI upscaling softwares

We asked ourselves what could be a fun and interesting way to see the MSSIM index in action and we came up with an idea: we made an analysis of the best AI upscaling softwares among a pre-chosen sample composed of ten popular ones. We chose as the reference image a photo of a beautiful Sonoma landscape, we downscaled it to 50% of the original size (from 1500×938 to 750×469) and then used the softwares to upscale it back to the original size. We then compared the upscaled versions with the reference image. This is what we found:

As described by the graph above, some softwares performed better than others, but “bigjpg.com” turned out to be the best AI upscaling software among the sample analyzed.

Conclusion

Despite being almost 20 years old, SSIM is still widely used in the context of image quality assessment, especially if working with compressed images and to train advanced machine learning algorithms, where SSIM serves as a crucial metric for evaluating the degree of blurriness in images.

Authors: Giorgio Micaletto, Andrea Procopio, Lorenzo Caputi