Waleed Alzuhair, flickr

Big Geodata Newsletter, November 2024

Become a high-skilled geospatial professional

Greetings from the Big Geodata Newsletter!

In this issue you will find information on Icechunk for transactional cloud-native data storage, Marimo for reactive notebooks for dynamic code cell updates, Coiled’s call for community input on a geospatial benchmark suite, STAC GeoParquet for efficient cloud-native geospatial data handling, and Fields of the World - a multinational dataset for agricultural boundary segmentation! 

Happy reading! 

You can access the previous issues of the newsletter on our web portal. If you find the newsletter useful, please share the subscription link below with your network.

Icechunk: Transactional Cloud Storage for Tensor Data

Image credits: Earthmover.io

Icechunk is a new open-source storage engine designed for the cloud-native management of tensor and multidimensional (ND-array) data. Tailored for scientific and machine learning applications, Icechunk integrates with Zarr, a format renowned for its handling of large structured datasets, and enhances it by adding transactional capabilities, ensuring isolated and safe read-write operations required for multi-user environments in cloud computing. One of Icechunk’s standout features is its "time travel" functionality, which allows users to access previous dataset snapshots, making it easier to revert to past states and monitor changes over time. This version-control feature, coupled with optimized chunk management, empowers users to conduct high-performance, reproducible data analyses without compromising storage or access efficiency. Icechunk's foundation in Rust enables high performance, while its Python integration facilitates usability by scientists and data analysts. 

You can dive deeper into Icechunk by exploring it on GitHub or visiting the official website!

Marimo: Reactive Interactive Notebooks in Python

Image credits: marimo.io

Marimo is an innovative open-source notebook system designed for reactive programming in Python. Unlike traditional notebooks, Marimo stands out with its reactive cells, meaning cells update automatically whenever the data they depend on changes. This reactivity allows users to experiment dynamically, making it a powerful tool for tasks that require iterative testing or real-time visualization. With Marimo, developers and data scientists can streamline workflows as changes cascade through cells automatically, reducing redundant recalculations and enhancing productivity. It also prevents out-of-sync cells that are common in Jupyter notebooks. Marimo is designed to integrate seamlessly with Python’s scientific libraries, making it ideal for data analysis, machine learning, and scientific computing. Its intuitive setup and support for complex dependency tracking make Marimo both flexible and user-friendly. 

Discover more about Marimo on GitHub or explore its features and applications on the official website. Experience a more dynamic way to interactive notebooks with Marimo! 

Geospatial Benchmark Suite: Call for Contributions

Image credits: Coiled.io

Coiled, renowned for its dask-based cloud computing infrastructure, is building a geospatial benchmark suite to test and evaluate distirbuted computing performance across diverse geospatial workloads. This open project aims to identify and solve computational challenges specific to geospatial data processing. To create a comprehensive and versatile suite, Coiled invites contributors to propose geospatial workloads that reflect real-world scenarios and data processing needs. If you’re working with geospatial data in fields such as climate science, urban planning, or remote sensing, this is a nice opportunity to shape a tool useful for the geospatial data science community.

You can join the initiative and share your suggestions on the GitHub discussion. More information about the suite is available here

STAC GeoParquet: Optimizing Geospatial Data for the Cloud


Image credits: INPE, Brazil

STAC GeoParquet combines the SpatioTemporal Asset Catalog (STAC) standard with Apache Parquet to optimize the storage, access, and analysis of large-scale geospatial data. STAC is widely used to catalog and organize spatial assets, making them easier to discover and retrieve. Apache Parquet, a columnar storage format designed for efficient data analytics, complements this by offering highly optimized data access capabilities, particularly suited for cloud-based data storage and processing. STAC GeoParquet stores metadata of geospatial datasets in the GeoParquet format. This not only allows for cloud-native access but also enables seamless integration with popular data analysis tools like Dask, Pandas, and Apache Spark to query and manipulate data more effectively across distributed platforms.

Learn more about this project on GitHub here and explore its integration with STAC and Apache Parquet

Upcoming Meetings

Recent Releases

The "Big" Picture


Image credits: Kerner et al. (2024)

The Fields of The World (FTW) dataset was developed to address the need for ML-ready datasets for automatic extraction of agricultural field boundaries from remotely sensed imagery. By spanning diverse agricultural landscapes globally, FTW enables country-specific evaluations for a larger set of regions than any previous dataset. Existing field segmentation models have shown good performance with datasets like PASTIS and AI4Boundaries, but their limited scope restricts application to broader landscapes. FTW, with its 70,462 labeled samples covering more than 166,000 km², surpasses previous datasets in size, diversity, and geographic representation, enabling consistent, granular evaluations and promoting transfer learning and zero-shot generalization across different regions. By providing comprehensive metadata for each sample, FTW enables flexible future applications and expansions as more global field data becomes available.  

Explore FTW’s full potential on GitHub or access the dataset via Source Cooperative.

Kerner, H., Chaudhari, S., Ghosh, A., Robinson, C., Ahmad, A., Choi, E., Jacobs, N., Holmes, C., Mohr, M., Dodhia, R., Ferres, J. M. L., and Marcus, J. (2024) Fields of the World: A Machine Learning benchmark dataset for global agricultural field boundary segmentation. arXiv (Cornell University). doi:10.48550/arxiv.2409.16252 

CRIB News
Tools for Sustainable Research Software

One of the research topics we concentrate on at CRIB is sustaintable research software. Together with the Netherlands eScience Center, we organized a mini symposium in 2022 about the topic with a special focus on geospatial software. Recently we are working with a large number of partners in the TDCC NES project on "Best Practices for Sustainable Software". The work package we are leading aims to develop tools to enable and facilitate the best practices.

In October, during the National Open Science Festival (OSF 2024), we had the first public announcement of two such tools: MetaTemplate - a modern meta-template (i.e. template of templates) for creating research software templates, and CodeScanner - a Python package and command line tool to check code quality and conformity with research software development best practice. Currently both tools are in alpha version, so some further development is required to make them fully functional. But the first feedback we got was quite positive, as you can see in the post-it picture!

If you want to have a look at the ideas we have regarding research software templates and conformity checking tools, you can check our OSF 2024 presentation slides available on Zenodo.