Greetings from the Big Geodata Newsletter!
In this issue you will find information on new ML-ready dataset standards Croissant and MajorTOM, GPU accelerated data analytics with NVIDIA RAPIDS cuDF, and a new analysis ready platform for Earth Video Cubes.
Max Gabrielsson from DuckDB Labs will introduce high-performance vectorized database system DuckDB and its Spatial Extension during our next Big Geodata Talk. Don't miss the chance to have a first-hand walk through of the internals that make DuckDB special, as well as some of the challenges encountered when adapting it for geospatial processing. Register now!
Ir. Jorges Nofulla, ITC graduate in 2023, shared his experience of using our Geospatial Computing Platform to accomplish his MSc. Thesis project on Deep Learning-based Change Detection and Classification for Airborne Laser Scanning Data under the supervision of Dr. Sander Oude Elberink and Dr. Michael Yang from the Department of Earth Observation Science (EOS). Don't miss the Big Geodata Story!
Happy reading!
You can access the previous issues of the newsletter on our web portal. If you find the newsletter useful, please share the subscription link below with your network.
MajorTOM-Core: Largest ML-ready Sentinel-2 dataset!
Image credits: Major TOM, 2024
EO-based deep learning models face the challenge of being interoperable and reproducible mainly due to the underlying nature of EO datasets. The differing formats, extents, overlaps, and data structures also make it difficult to access or combine multiple datasets within the same model. Major TOM (Terrestrial Observation Metaset) – a community-oriented project from the ESA Φ-lab, tries to address this challenge with a framework to define high-quality baseline EO datasets that can be used for AI research. It primarily outlines a simple 10 km grid geographical indexing system and a comprehensive metadata structure that facilitates combination of multiple datasets from different sources. Using such a format to build EO datasets used in deep learning models will open the possibility of bias evaluation and make models easily adaptable to new domains and sensors.
Following this, Major TOM-Core was launched as one such ML-ready Sentinel-2 dataset (45 TB) hosted on the Hugging Face Hub. Users can filter through the 5000 metadata parquet files and either download or stream the dataset as illustrated in the demo notebooks on the project’s GitHub repository.
Croissant: A High-Level Format for ML Datasets
Image credits: Andrea Viliotti, 2024
The common challenge of handling a wide variety of data representations is faced not only by the EO researchers at the ESA Φ-lab, but also by many research groups focusing on Machine Learning in other domains. To address this challenge, and to promote interoperability and reproducibility the MLCommons community working group with multiple other contributors recently released Croissant – a high-level format to discover as well as work seamlessly with machine learning datasets. It combines metadata, resource file descriptions, data structure descriptions and ML semantics such as defining train-test sets. Built on earlier schema.org and DCAT standards, the format is designed to be extensible and relevant for multiple domains and platforms. The Croissant Responsible AI vocabulary includes concepts of biases, fairness and use of human labelling. The Croissant defined datasets is supported by popular ML frameworks, can be created, or edited using either the mlcroissant python library or using a visual editor.
As the gap between the larger ML community and EO-based ML research narrows, the need to standardize vast sizes of EO data becomes imperative. Members of NASA's Interagency Implementation and Advanced Concepts Team (IMPACT) are working on a Geo-Croissant extension to include specific characteristics such as spatial reference information, geographical biases and sampling strategy.
150x Speed Boost to pandas with RAPIDS cuDF!
Image credits: NVIDIA, 2024
NVIDIA has dramatically improved the performance of its data processing library RAPIDS cuDF, which now accelerates pandas data processing by nearly 150 times without requiring any code changes. Revealed at GTC 2024, this enhancement allows data scientists and analysts to utilize powerful GPU acceleration effortlessly, maintaining the familiar pandas workflow on both CPUs and GPUs. This dual-compatibility ensures that complex data operations are not only faster but also more efficient, automatically switching between processing units based on task demands.
For further details, consider reading the full article on NVIDIA's official blog on RAPIDS cuDFs.
Upcoming Meetings
- eScience Center Training: Introduction to Geospatial Raster and Vector Data with Python
13 May 2024, Online - CRIB Training: Publishing Research Data with fairly Toolset
15 May 2024, ITC, Enschede - SURF Training: Basic Parallel Programming with MPI and OpenMP
21-22 May 2024, Online - SURF Training: MPI and OpenMP in Scientific Software Development
27-29 May 2024, Online - eScience Center Training: Collaborative Version Control with git and GitHub
28 May 2024, Amsterdam - Big Geodata Talk: High-performance Spatial Data Management and Analysis with DuckDB
31 May 2024, ITC, Enschede
- Geospatial Computing Platform Users Meeting
12 June 2024, ITC, Enschede
Recent Releases
- cuGraph: GPU accelerated graph algorithms
24.04.00 (10/04/2024) - rasterio: Read and write geospatial raster datasets
1.3.10 (12/04/2024) - CuPy: NumPy-compatible matrix library accelerated by CUDA
13.1.0 (19/04/2024) - PyTorch: Machine learning library based on the Torch library
2.3.0 (24/04/2024)
The "Big" Picture
Image credits: Li et al., 2024
The Earth Video Cube (EVC) is a sophisticated spatiotemporal data cube, which facilitates seamless integration of diverse data sources, including satellite and UAV imagery, enabling comprehensive cross-source analysis. The system's capacity for rapid data processing and its potential for future AI integration makes it a valuable tool for real-time decision-making and environmental monitoring. The design of the EVC and its implementation of Analysis Ready Video Data simplifies the complexities involved in processing vast video datasets. By providing a structured framework that enables the categorization and analysis of video content at frame, object, trajectory, and event levels, EVC allows for the efficient transformation of raw video data into actionable insights.
Li, Z., Cao, Z., Yue, P., and Zhang, C. (2024) Earth Video Cube: A Geospatial Data Cube for Multisource Earth Observation Video Management and Analysis, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17, 4986-5000, 2024, doi:10.1109/JSTARS.2024.3358342