Waleed Alzuhair, flickr

Big Geodata Newsletter, May 2024

Become a high-skilled geospatial professional

Greetings from the Big Geodata Newsletter!

In this issue you will find information on new ML-ready dataset standards Croissant and MajorTOM, GPU accelerated data analytics with NVIDIA RAPIDS cuDF, and a new analysis ready platform for Earth Video Cubes.

Max Gabrielsson from DuckDB Labs will introduce high-performance vectorized database system DuckDB and its Spatial Extension during our next Big Geodata Talk. Don't miss the chance to have a first-hand walk through of the internals that make DuckDB special, as well as some of the challenges encountered when adapting it for geospatial processing. Register now!

Ir. Jorges Nofulla, ITC graduate in 2023, shared his experience of using our Geospatial Computing Platform to accomplish his MSc. Thesis project on Deep Learning-based Change Detection and Classification for Airborne Laser Scanning Data under the supervision of Dr. Sander Oude Elberink and Dr. Michael Yang from the Department of Earth Observation Science (EOS). Don't miss the Big Geodata Story!

Happy reading! 

You can access the previous issues of the newsletter on our web portal. If you find the newsletter useful, please share the subscription link below with your network.

MajorTOM-Core: Largest ML-ready Sentinel-2 dataset!

Image credits: Major TOM, 2024

EO-based deep learning models face the challenge of being interoperable and reproducible mainly due to the underlying nature of EO datasets. The differing formats, extents, overlaps, and data structures also make it difficult to access or combine multiple datasets within the same model. Major TOM (Terrestrial Observation Metaset) – a community-oriented project from the ESA Φ-lab, tries to address this challenge with a framework to define high-quality baseline EO datasets that can be used for AI research. It primarily outlines a simple 10 km grid geographical indexing system and a comprehensive metadata structure that facilitates combination of multiple datasets from different sources. Using such a format to build EO datasets used in deep learning models will open the possibility of bias evaluation and make models easily adaptable to new domains and sensors. 

Following this, Major TOM-Core was launched as one such ML-ready Sentinel-2 dataset (45 TB) hosted on the Hugging Face Hub. Users can filter through the 5000 metadata parquet files and either download or stream the dataset as illustrated in the demo notebooks on the project’s GitHub repository.

Croissant: A High-Level Format for ML Datasets


Image credits: Andrea Viliotti, 2024

The common challenge of handling a wide variety of data representations is faced not only by the EO researchers at the ESA Φ-lab, but also by many research groups focusing on Machine Learning in other domains. To address this challenge, and to promote interoperability and reproducibility the MLCommons community working group with multiple other contributors recently released Croissant – a high-level format to discover as well as work seamlessly with machine learning datasets. It combines metadata, resource file descriptions, data structure descriptions and ML semantics such as defining train-test sets. Built on earlier schema.org and DCAT standards, the format is designed to be extensible and relevant for multiple domains and platforms. The Croissant Responsible AI vocabulary includes concepts of biases, fairness and use of human labelling. The Croissant defined datasets is supported by popular ML frameworks, can be created, or edited using either the mlcroissant python library or using a visual editor

As the gap between the larger ML community and EO-based ML research narrows, the need to standardize vast sizes of EO data becomes imperative. Members of NASA's Interagency Implementation and Advanced Concepts Team (IMPACT) are working on a Geo-Croissant extension to include specific characteristics such as spatial reference information, geographical biases and sampling strategy.

150x Speed Boost to pandas with RAPIDS cuDF!

Image credits: NVIDIA, 2024

NVIDIA has dramatically improved the performance of its data processing library RAPIDS cuDF, which now accelerates pandas data processing by nearly 150 times without requiring any code changes. Revealed at GTC 2024, this enhancement allows data scientists and analysts to utilize powerful GPU acceleration effortlessly, maintaining the familiar pandas workflow on both CPUs and GPUs. This dual-compatibility ensures that complex data operations are not only faster but also more efficient, automatically switching between processing units based on task demands. 

For further details, consider reading the full article on NVIDIA's official blog on RAPIDS cuDFs.

Upcoming Meetings

Recent Releases

The "Big" Picture

Image credits: Li et al., 2024 

The Earth Video Cube (EVC) is a sophisticated spatiotemporal data cube, which facilitates seamless integration of diverse data sources, including satellite and UAV imagery, enabling comprehensive cross-source analysis. The system's capacity for rapid data processing and its potential for future AI integration makes it a valuable tool for real-time decision-making and environmental monitoring. The design of the EVC and its implementation of Analysis Ready Video Data simplifies the complexities involved in processing vast video datasets. By providing a structured framework that enables the categorization and analysis of video content at frame, object, trajectory, and event levels, EVC allows for the efficient transformation of raw video data into actionable insights. 

Li, Z., Cao, Z., Yue, P., and Zhang, C. (2024) Earth Video Cube: A Geospatial Data Cube for Multisource Earth Observation Video Management and Analysis, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17, 4986-5000, 2024, doi:10.1109/JSTARS.2024.3358342

CRIB News
New Cloud Infrastructure

We are engaged in a significant expansion of our on-premises cloud infrastructure using OpenStack. This upgrade is set to introduce more advanced computational resources including virtual machines (VMs) and container technologies. Scheduled for availability by June, these enhancements are designed to provide our staff and students with robust, scalable tools to support various academic research projects and educational needs. Stay tuned and if you want to learn more, register for our Geospatial Computing Platform Users Meeting on 12 June!