Greetings from the Big Geodata Newsletter!
In this issue, explore GeoJupyter for interactive geospatial computing within Jupyter, discover Amadeus for simplifying large-scale environmental data analysis in R, and learn about NCZarr, which connects netCDF and Zarr for scalable, cloud-native scientific data management. You’ll also find insights into GeoCore, a powerful and scalable framework for geospatial machine learning, and EarthView, a large-scale remote sensing dataset designed to advance self-supervised learning in Earth observation applications.
Happy reading!
You can access the previous issues of the newsletter on our web portal. If you find the newsletter useful, please share the subscription link below with your network.
GeoJupyter: Enhancing interactive geospatial computing
Image credits: JupyterGIS, 2025
GeoJupyter is an open-source initiative designed to bridge the gap between traditional GIS tools and code-driven geospatial workflows within Jupyter. It integrates interactive mapping, reproducible workflows, and collaborative storytelling, making geospatial analysis more accessible for education, research, and industry applications. GeoJupyter enables users to explore and analyze spatial data through interactive maps, combining the flexibility of Jupyter Notebooks with the visualization capabilities of desktop GIS tools. It supports popular Python geospatial libraries such as GeoPandas, Rasterio, and Xarray, allowing users to conduct advanced spatial computations while ensuring workflow reproducibility. The platform also facilitates geospatial storytelling by integrating data visualization with narrative elements, making it useful for presenting spatial insights in a structured and engaging manner. The project is community-driven, encouraging contributions that improve existing tools and expand functionalities.
Learn more about GeoJupyter and its applications in geospatial computing here. You can also access its source code or discover its demo.
Amadeus: Streamlining large-scale environmental data analysis in R
Image credits: Manware et al., 2025
Amadeus is an R package designed to simplify access to large-scale environmental datasets, with applications in environmental health, ecology, and climatology. It provides functions for downloading, processing, and calculating covariates from publicly available data sources, primarily from NASA, NOAA, USGS, and EPA. The package integrates with popular spatial R packages and with its test-driven development approach it aims to facilitate reproducibility. The structured workflow of the tool minimizes the learning curve, making geospatial analysis more accessible to researchers from different domains. Use cases in environmental health research include exposure assessments, epidemiological modeling, and air pollution analysis. Future developments for Amadeus include expanding its dataset coverage beyond the U.S., incorporating gap-filling techniques for missing data, and improving its adaptability to different statistical workflows.
Discover the full capabilities of Amadeus for environmental data analysis here. Explore amadeus’s repository and documentation here.
Bridging netCDF and Zarr for scalable cloud-native scientific data management
The National Science Foundation's Unidata program has developed NCZarr, an extension to the netCDF framework that integrates the Zarr data and storage models. This initiative aims to enhance interoperability and accessibility of scientific data in cloud-native environments. NCZarr enables the storage and manipulation of netCDF-4 data models within Zarr-compatible storage systems, facilitating seamless access to data stored in cloud platforms like Amazon S3. This integration allows researchers to leverage existing netCDF tools and workflows while benefiting from the scalability and flexibility of cloud-based storage solutions. The NCZarr data model supports essential netCDF-4 features, including shared dimensions, attributes, chunking, fill values, groups, and N-dimensional variables. However, certain netCDF-4 features, such as user-defined types and the String type, are currently unsupported. Despite these limitations, NCZarr maintains compatibility with uncompressed Zarr datasets, ensuring that datasets conforming to the Zarr version 2 specification are readable by NCZarr and vice versa.
For more detailed information on NCZarr, including its data model, enabling support, and accessing data using the NCZarr protocol, refer to the NetCDF User's Guide.
GeoCore: An efficient and scalable framework to optimize geospatial machine learning
Image credits: GeoCore, 2025
GeoCore is an open-source Python library developed by Zanskar Geothermal & Minerals to improve geospatial machine learning workflows. Designed to support large-scale data analysis, it integrates H3 grid indexing, automated feature caching, and spatial cross-validation to optimize data processing. The framework enables seamless interaction with spatial databases such as PostgreSQL, Snowflake, and BigQuery, ensuring efficient data management and analysis. GeoCore provides a dynamic registry system for managing machine learning models and experiment tracking using MLflow. The INGENIOUS project applied GeoCore to analyze geothermal systems across the Great Basin region, demonstrating its ability to handle high-resolution geospatial data. Future developments will focus on enhanced model reproducibility, advanced spatial cross-validation techniques, and greater scalability for large datasets.
Learn more about GeoCore and its applications in geospatial ML here. Read the full research paper presented at the Stanford Geothermal Workshop here.
Upcoming EVENTS
- Machine Learning in Python with Scikit-learn
eScience Center, Amsterdam, 11-12 March - Introduction to Geospatial Raster and Vector Data with Python
ITC, Enschede, 12 - 13 March - AI for Good Workshop: Earth Observation Foundation Models with Prithvi EO 2.0 and Terratorch
Online, 19 March - Introduction to Supercomputing, Part II
SURF, Amsterdam, 24 March - Introduction to Supercomputing, Part III
SURF, Amsterdam, 25 March - Cloud Native Geospatial Conference 2025
CGN, Utah, USA, 30 April - 2 May
The "Big" Picture
Image credits: Velazque et al., 2025
EarthView is a large-scale dataset designed to advance self-supervised learning (SSL) in remote sensing. Spanning 15 tera pixels of global remote sensing data, EarthView integrates multispectral, hyperspectral, radar, and topographical data from sources like Sentinel, NEON, and Satellogic. The dataset is structured in parquet format and hosted on Hugging Face, making it easily accessible for large-scale Earth monitoring tasks. To leverage the dataset’s diverse modalities, the authors developed EarthMAE, a customized masked autoencoder optimized for self-supervised feature learning. EarthMAE processes heterogeneous data, demonstrating improved performance on downstream tasks such as land cover classification and scene understanding. The model's experiments highlight the benefits of multi-source pretraining, with results showing that using high-resolution Satellogic data enhances model generalization. Future research directions include refining masking strategies for self-supervised training, improving model scalability, and exploring additional data sources such as text-based metadata.
Learn more about EarthView and how it supports self-supervised learning here.
Velazquez, D., López, P. R., Alonso, S., Gonfaus, J. M., Gonzalez, J., Richarte, G., Marin, J., Bengio, Y., & Lacoste, A. (2025). EarthView: a large scale remote sensing dataset for Self-Supervision. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2501.08111