Return to Articles

GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI

Paolo Fraccaro
Alexander Lacoste
Naomi Simumba
Nils Lehmann

By IBM and ServiceNow and the AI Alliance Climate & Sustainability Working Group

A New Era for Geospatial AI

Geospatial Foundation Models (GeoFMs) are large-scale AI models trained on diverse Earth observation data to support multiple geospatial tasks. They are transforming how we understand and manage our planet. But as these models grow in complexity and capability, one question becomes critical:

How do we measure progress in a consistent, meaningful way?

Current efforts face limitations in scope, licensing, reproducibility and usability, hindering adoption and comparison. Today, ahead of the full release, we unveil the first results on GEO-Bench-2 — a benchmark designed to evaluate GeoFMs across diverse tasks and datasets. With GEO-Bench-2 we aim to provide a shared framework that accelerates innovation and collaboration in geospatial AI.

What Makes GEO-Bench-2 Unique?

GEO-Bench-2 is more than a dataset collection. It’s a comprehensive evaluation ecosystem built to answer open research questions and guide the next generation of GeoFMs.

1. Diverse Dataset Aggregation

We’ve aggregated 19 datasets, organized into 9 subsets, each targeting specific capabilities aligned with critical research challenges. Datasets have been carefully reviewed to have a permissive license and high data quality as well as sub-sampled, if necessary, to decrease computational resources needed for evaluation.

Table 1 – GEO-Bench-2 datasets and capabilities.

2. Flexible Evaluation Protocol

Our protocol ensures fair, comparable, and meaningful evaluations, regardless of model architecture or pre-training strategy. It accounts for both Hyper Parameter Tuning (HPO) and repeated experiments to allow models to adapt to the specific task and account for variability across different runs. This flexibility fosters innovation while maintaining scientific rigor. To ensure an apple-to-apple comparison when aggregating across datasets, the protocol also recommends dataset-wise min-max normalization, as well as the use of bootstrapped interquartile mean to exclude outliers and account for uncertainty.

3. Extensive Baseline Experiments

Following the GEO-Bench-2 protocol, we ran large-scale experiments using TerraTorch with more than 15,000 runs across popular open GeoFMs and compared to latest open Computer Vision approaches (See Table 2) to uncover:

  • Which models excel at which capabilities?
  • Where do they fall short?
  • What challenges lie ahead?

Table 2 – Models considered in the GEO-Bench-2 experiments. Optical indicates use of EO data beyond RGB. Backbone params indicate an approximate count. N: number of training samples in million of images; Res: spatial resolution in meters.

Table 3 – Models ranking across all the different GEO-Bench-2 capabilities.

Table 3 shows results using a full finetuning approach. Clay V1, which uses a smaller patch size to retain more information at a higher computational cost, takes the lead in the Core capability, followed by ConvNext-based models. As expected, general computer vision models and those focusing on RGB-only data in pretraining (i.e. DINO V3 and ConvNext) top the <10m resolution and RGB/NIR capabilities. In contrast, EO-specific models such as TerraMind and Prithvi-EO-2.0, pretrained for this type of data, lead in multispectral-dependent tasks. These are critical for applications like agriculture, environmental monitoring, and disaster response, where RGB data alone is often insufficient.

4. A Public Leaderboard

Progress thrives on transparency. Our leaderboard tracks model performance on GEO-Bench-2, enabling the community to benchmark, compare, and improve. The leaderboard allows to dive into specific capabilities and provide settings to explore both full finetuning and frozen encoder approaches.

Figure 1 – GEO-Bench-2 leaderboard outline.

5. Integration with ready-to-use tools

GEO-Bench-2 comes as its own python package, available on GitHub with all datasets being stored in TACO format that will be available soon on the AI Alliance Hugging Face page. We built also a strong integration with TerraTorch, the most popular tool to finetune and experiment with GeoFMs. This allows to launch the benchmark, including HPO and repeated experiments, over GEO-Bench-2 datasets with just a simple config. Although this will make experimentation easier, this is not a hard requirement to submit scores to the leaderboard and users can use their favorite tools.

Why It Matters

By providing a shared yardstick, GEO-Bench empowers researchers and industry alike:

  • Researchers gain clarity on what works and what doesn’t.
  • Companies can benchmark solutions before deployment.
  • Society benefits from better tools for climate resilience, disaster response, and sustainability.

Join the Movement

GEO-Bench-2 is an open initiative led by IBM and ServiceNow in collaboration with  MBZUAI, NASA, ESA Φ-lab, Technical University Munich, Arizona State University, and Clark University. This is done under the AI Alliance Climate & Sustainability Group. Full release of datasets and leaderboard on Hugging Face will be provided soon, alongside a comprehensive technical report. This will explain how we obtained our results, a number of ablation studies, and considerations around model capabilities against number of parameters and pretraining dataset size. Stay tuned and help us shape the future of geospatial AI!

“By providing a shared yardstick, GEO-Bench-2 helps researchers, companies, and society navigate toward a smarter, more sustainable planet.”

Related Articles

View All

From Semiconductor to Maritime: A Blueprint for Domain-Specific AI in Safety-Critical Industries

From semiconductor fabs to open seas, the AI Alliance is redefining how domain-specific AI supports safety-critical industries. This blog spotlights Llamarine, a maritime large language model co-developed by Aitomatic and Furuno, building on lessons from SemiKong, the first semiconductor-specific model. Designed to embody real seamanship rather than generic knowledge, Llamarine integrates deep maritime regulations and Furuno’s decades of navigational expertise into its reasoning. The result is a model that provides deterministic, regulation-compliant, and operationally sound guidance—outperforming GPT-4o, Claude Sonnet 3.5, and other general-purpose models. Together, these projects outline a blueprint for trustworthy, specialized AI that can be applied across industries where precision and reliability are non-negotiable.

Defining Open Source AI: The Road Ahead

News

Open source and open science in AI is a practical, proven approach to enabling access, innovation, trust, and value creation now. Let’s focus on that as we better define it.

Architecture of Data Prep Kit Framework 

Technical Report

The Data Prep Kit (DPK) framework enables scalable data transformation using Python, Ray, and Spark, while supporting various data sources such as local disk, S3, and Hugging Face datasets. It defines abstract base classes for transformations, allowing developers to implement custom data and folder transforms that operate seamlessly across different runtimes. DPK also introduces a data abstraction layer to streamline data access and facilitate checkpointing. To support large-scale processing, it provides three runtimes: Python for small datasets, Ray for distributed execution across clusters, and Spark for highly scalable processing using Resilient Distributed Datasets (RDDs). Additionally, DPK integrates with Kubeflow Pipelines (KFP) for automating transformations within Kubernetes environments. The framework includes transform utilities, testing support, and simplified APIs for invoking transforms efficiently. By abstracting complexity, DPK simplifies development, deployment, and execution of data processing pipelines in both local and distributed environments.