VLEO-Bench

🔔News

🔥[2024-01-22]: Initial release of our benchmark!

Introduction

Large Vision-Language Models (VLMs) have demonstrated impressive performance on complex tasks involving visual input with natural language instructions. However, it remains unclear to what extent capabilities on natural images transfer to Earth observation (EO) data, which are predominantly satellite and aerial images less common in VLM training data. In this work, we propose a comprehensive benchmark to gauge the progress of VLMs toward being useful tools for EO data by assessing their abilities on scene understanding, localization and counting, and change detection tasks. Motivated by real-world applications, our benchmark includes scenarios like urban monitoring, disaster relief, land use, and conservation. We discover that, although state-of-the-art VLMs like GPT-4V possess extensive world knowledge that leads to strong performance on open-ended tasks like location understanding and image captioning, their poor spatial reasoning limits usefulness on object localization and counting tasks.

Overview

In this paper, we provide an application-focused evaluation of instruction-following VLMs like GPT-4V for different capabilities in EO, including location understanding, zero-shot remote sensing scene understanding, world knowledge, text-grounded object localization and counting, and change detection. These capabilities provide the EO community with pathways for impact in real-world application areas, including urban monitoring, disaster relief, land use, and conservation.

Desired Capabilities for EO Data. To build an EO benchmark for VLMs, we focus on three broad categories of capabilities in our initial release: scene understanding, localization and counting, and change detection. Within each category, we construct evaluations based on applications ranging from animal conservation to urban monitoring. Our goals are to (1) evaluate the performance of existing VLMs, (2) provide insights into prompting techniques suitable for repurposing existing VLMs to EO tasks, and (3) implement an interface of data and models for flexible benchmark updates and evaluations of future VLMs. Our categories and tasks are:

Scene Understanding: To evaluate how VLMs combine high-level information extracted from images with latent knowledge learned through language modeling, we construct three datasets:
1. a new aerial landmark recognition dataset to test the model's ability to recognize and geolocate landmarks in the United States;
2. the RSICD dataset to evaluate the model's ability to generate open-ended captions for Google Earth images;
3. the BigEarthNet dataset to probe the model's ability to identify land cover types in medium-resolution satellite images, and
4. the fMoW-WILDS and PatternNet datasets to assess the model's ability to classify land use in high-resolution satellite images.
Localization & Counting: To evaluate whether VLMs can extract fine-grained information about a specific object and understand its spatial relationship to other objects, we assemble three datasets:
1. the DIOR-RSVG dataset to assess Referring Expression Comprehension (REC) abilities, in which the model is required to localize objects based on their natural language descriptions;
2. the NEON-Tree, COWC, and xBD datasets to assess counting small objects like cluttered trees, cars, and buildings in aerial and satellite images;
3. the aerial animal detection dataset to gauge counting animal populations from tilted aerial images taken by handheld cameras.
Change Detection: To evaluate if VLMs can identify differences between multiple images and complete user-specified tasks based on such differences, we repurpose the xBD dataset. We show the model two high-resolution images taken before and after a natural disaster and ask it to assign damaged buildings to qualitative descriptions of damage categories.

We note that a number of capabilities desired for EO data remain unattainable by current-generation VLMs due to their inability to ingest multi-spectral, non-optical, or multi-temporal images. This is unlikely to be addressed by the vision community while its focus remains on natural images. Furthermore, available VLMs do not yet perform image segmentation, although we expect this to change in the near future.

GPT-4V has scene understanding abilities but cannot accurately count or localize objects. We only select part of the user prompt and model response for illustration purposes.

Result Overview

Below, we summarize insights from our evaluations, with a focus on GPT-4V, as it is generally the best-performing VLM across Earth observation tasks. We elaborate on the results in Sections Scene Understanding, Localization & Counting, and Change Detection.

Scene Understanding:
1. On our new aerial landmark recognition task, GPT-4V achieves an overall accuracy of 0.67, surpassing open models by a large margin and demonstrating its comprehensive world knowledge. There appear to be regional disparities, with GPT-4V generally performing better in coastal states. In addition, although GPT-4V sometimes generates sophisticated reasoning paths, the reasoning can be incorrect despite a correct final answer.
2. On RSICD image captioning, GPT-4V achieves a RefCLIPScore of 0.75, which measures both image-text semantic similarity and caption-reference similarity. Although GPT-4V does not achieve high similarity between generated and reference captions, our qualitative assessment is that it generates even more detailed captions than the humans employed in RSICD.
3. On land cover/land use classification tasks, GPT-4V performance varies depending on image resolution, label ambiguity, and label granularity. On fMoW-WILDS, the average F1-score is 0.19; on PatternNet, average F1-score is 0.71, and on BigEarthNet, average F1-score is 0.38. High performance on PatternNet can be attributed to high image resolution and disambiguated labels. Low performance on fMoW-WILDS is largely due to ambiguous labels, which we discuss in Section Land Cover/Land Use Classification.
Localization & Counting:
1. On DIOR-RSVG object localization, GPT-4V obtains a mean intersection-over-union (IoU) of 0.16; only 7.6% of the test images have an IoU > 0.5, while a model that specializes in outputting bounding boxes achieves a mean IoU of 0.68.
2. While GPT-4V achieves moderate accuracies on the COWC vehicle counting and xBD building counting tasks, it fails on NEON-Tree counting and aerial animal detection.
Change Detection: On xBD change detection, GPT-4V fails to count and categorize the damaged buildings, with a performance score for buildings in the "destroyed" category. Although GPT-4V can count the number of buildings before a disaster with moderate accuracy, it systematically fails to assess the building damage by contrasting before and after images. This systematic failure makes it unusable for disaster relief applications that require counting abilities.

BibTeX


      @article{zhang2024vleobench,
        title   = {Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data},
        author  = {Chenhui Zhang and Sherrie Wang},
        year    = {2024},
        journal = {arXiv preprint arXiv: 2401.17600}
      }

VLEO-Bench

A Comprehensive Benchmark of Vision-Language Models for Earth Observation (EO) Data

🔔News

Introduction

VLEO-Bench

Overview

Experiment Results

Result Overview

Leaderboard

BibTeX