CrossScore: Towards Multi-View Image Evaluation and Scoring

Zirui Wang1    Wenjing Bian1    Omkar Parkhi2    Yuheng Ren2    Victor Adrian Prisacariu1

1University of Oxford     2Meta Reality Lab

Arxiv Code (Comming Soon)

TLDR: This method evaluates an image by comparing it with multiple views of the same scene through cross-attention, eliminating the need for a pre-aligned ground truth image.

Application: Evaluate rendered images from novel view synthesis (NVS) applications where ground truth references are unavailable.

We introduce an image assessment method that examines query images by referencing multiple views of the same scene, producing results termed CrossScore maps. Our results show that CrossScore is closely correlated with SSIM across diverse datasets, without requiring pre-aligned ground truth images. Colour coding: red represents the highest score, followed by orange, green, and blue, indicating decreasing scores respectively.


We introduce a novel Cross-Reference image quality assessment method that effectively fills the gap in the image assessment landscape, complementing the array of established evaluation schemes -- ranging from Full-Reference metrics like SSIM, No-Reference metrics such as NIQE, to General-Reference metrics including FID, and Multi-Modal-Reference metrics, e.g. CLIPScore.

We propose a novel cross-reference (CR) image quality assessment (IQA) scheme, which evaluates a query image using multiple unregistered reference images that are captured from different viewpoints. This approach sets a new research trajectory apart from conventional IQA schemes such as full-reference (FR), general-reference (GR), no-reference (NR), and multi-modal-reference (MMR).

Utilising a neural network with the cross-attention mechanism and a unique data collection pipeline from NVS optimisation, our method enables accurate image quality assessment without requiring ground truth references. By comparing a query image against multiple views of the same scene, our method addresses the limitations of existing metrics in novel view synthesis (NVS) and similar tasks where direct reference images are unavailable. Experimental results show that our method is closely correlated to the full-reference metric SSIM, while not requiring ground truth references.


Our goal is to evaluate the quality of a query image, using a set of reference images that capture the same scene as the query image but from other viewpoints. From the NVS application perspective, the query image is often a rendered image with artefacts, and the reference images consists of the real captured images.

Method Overview. Left: Our NVS-based data engine that supplies query and reference images along with SSIM maps to drive the self-supervised training of our model. Right: Our model that takes a query image and a set of reference images as input and predicts a score map for the query image.


We propose a network that takes a query image and a set of reference images and predict a dense score map for the query image. Our network consists of three components:

  1. an image encoder which extracts feature maps from input images;
  2. a cross-reference module that associates a query image with multi-view reference images; and
  3. a score regression head that regresses a CrossScore for each pixel of the query image.

In practice, we adapt a pretrained DINOv2-small model as the image encoder, a Transformer Decoder for the cross-reference module, and a shallow MLP for the score regression head.

Self-supervised Training

We leverage existing NVS systems and abundant multi-view datasets to generate SSIM maps for our training.

Specifically, we select Neural Radiance Field (NeRF)-style NVS systems as our data engine. Given a set of images, a NeRF recovers a neural representation of a scene by iteratively reconstructing the given image set with photometric losses.

By rendering images with the camera parameters from the original captured image set at multiple NeRF training checkpoints, we generate a large number of images that contain various types of artefacts at various levels. From which, we compute SSIM maps between rendered images and corresponding real captured images, which serve as our training objectives.

Additional Results

Evaluating images rendered from a popular NVS method (Gaussian-Splatting) using CrossScore and SSIM. CrossScore is highly correlated with SSIM, while not requiring ground truth images.

Ablation: Enable and Disable Reference Images

Here, we show our method effectively leverage reference views while evaluating a query image. With reference images enabled (ON), the score map predicted by our method contains more details than when reference images are disabled (OFF), where the model tends to assign a high score everywhere.

Ablation study on the importance of reference images.

Attention Weights Visualisation

We further illustrate that our model indeed checking related context in reference images, as evidenced by the visualisation of attention maps below.

Attention weights visualisation of our model. Top left: a query image with a region of interest (centre of image) highlighted with a magenta box. Right column: three reference images from our cross-reference set with attention maps overlaid. The attention maps illustrate the attention that is paid to predicting image quality at the query region. Red and blue denote high and low attention weights respectively. Note that we use 5 reference images in our experiment, but only 3 are shown due to space constraint. Bottom: Predicted CrossScore map and SSIM map. Red and blue denote high and low quality image regions respectively.


This research is supported by an ARIA research gift grant from Meta Reality Lab. We gratefully thank Shangzhe Wu, Tengda Han, Zihang Lai for insightful discussions, and Michael Hobley for proofreading.


    title={CrossScore: Towards Multi-View Image Evaluation and Scoring},
    author={Zirui Wang and Wenjing Bian and Omkar Parkhi and Yuheng Ren and Victor Adrian Prisacariu},
    journal={arXiv preprint arXiv:2404:14409},