CrossScore: Towards Multi-View Image Evaluation and Scoring
ECCV 2024
University of Oxford
TLDR: This method evaluates an image by comparing it with multiple views of the same scene through cross-attention, eliminating the need for a pre-aligned ground truth image.
Application: Evaluate rendered images from novel view synthesis (NVS) applications where ground truth references are unavailable.
Presentation at ECCV 2024
Abstract
We introduce a novel Cross-Reference image quality assessment method that effectively fills the gap in the image assessment landscape, complementing the array of established evaluation schemes -- ranging from Full-Reference metrics like SSIM, No-Reference metrics such as NIQE, to General-Reference metrics including FID, and Multi-Modal-Reference metrics, e.g. CLIPScore.
Utilising a neural network with the cross-attention mechanism and a unique data collection pipeline from NVS optimisation, our method enables accurate image quality assessment without requiring ground truth references. By comparing a query image against multiple views of the same scene, our method addresses the limitations of existing metrics in novel view synthesis (NVS) and similar tasks where direct reference images are unavailable. Experimental results show that our method is closely correlated to the full-reference metric SSIM, while not requiring ground truth references.
Method
Our goal is to evaluate the quality of a query image, using a set of reference images that capture the same scene as the query image but from other viewpoints. From the NVS application perspective, the query image is often a rendered image with artefacts, and the reference images consists of the real captured images.
Network
We propose a network that takes a query image and a set of reference images and predict a dense score map for the query image. Our network consists of three components:
- an image encoder which extracts feature maps from input images;
- a cross-reference module that associates a query image with multi-view reference images; and
- a score regression head that regresses a CrossScore for each pixel of the query image.
Self-supervised Training
We leverage existing NVS systems and abundant multi-view datasets to generate SSIM maps for our training.
Specifically, we select Neural Radiance Field (NeRF)-style NVS systems as our data engine. Given a set of images, a NeRF recovers a neural representation of a scene by iteratively reconstructing the given image set with photometric losses.
By rendering images with the camera parameters from the original captured image set at multiple NeRF training checkpoints, we generate a large number of images that contain various types of artefacts at various levels. From which, we compute SSIM maps between rendered images and corresponding real captured images, which serve as our training objectives.
Additional Results
Ablation: Enable and Disable Reference Images
Here, we show our method effectively leverage reference views while evaluating a query image. With reference images enabled (ON), the score map predicted by our method contains more details than when reference images are disabled (OFF), where the model tends to assign a high score everywhere.
Attention Weights Visualisation
We further illustrate that our model indeed checking related context in reference images, as evidenced by the visualisation of attention maps below.
Acknowledgement
This research is supported by an ARIA research gift grant from Meta Reality Lab. We gratefully thank Shangzhe Wu, Tengda Han, Zihang Lai for insightful discussions, and Michael Hobley for proofreading.
BibTeX
@inproceedings{wang2024crossscore, title={CrossScore: Towards Multi-View Image Evaluation and Scoring}, author={Zirui Wang and Wenjing Bian and Victor Adrian Prisacariu}, booktitle={ECCV}, year={2024} }