Greetings from the Big Geodata Newsletter!
In this issue you will find information on HyperCoast, a new Python package for hyperspectral data; Coiled’s benchmarking of DataFrame technologies; news on the retirement of Microsoft’s Planetary Computer Hub; and on SIRCLE and SWAG, two new models that tackle the challenges of processing petabyte-scale EO data.
Dr. Jon Wang from the Department of Urban and Regional Planning and Geo-Information Management (PGM) at ITC shares his experience in using our Geospatial Computing Platform in Understanding Cities with their Physical Forms at Large Scale. Don't miss the Big Geodata Story!
Happy reading!
You can access the previous issues of the newsletter on our web portal. If you find the newsletter useful, please share the subscription link below with your network.
HyperCoast: Interactive Hyperspectral Data Visualization!
Image credits: HyperCoast, 2024
HyperCoast is a new Python package that can interactively visualize and analyse hyperspectral data. The packages lets users search and download from well-known NASA hyperspectral datasets such as AVIRIS, DESIS, PACE, and ECOSTRESS. As every hyperspectral sensor uses different file formats, the developers built specific tools that can visualize data for each kind of dataset. With the possibility of changing band combinations and colormaps, the module facilitates interactive extraction and visualization of spectral signatures in the JupyterLab interface. The selected spectral signatures can then be saved as a CSV file. Built on 3D visualization package PyVista, HyperCoast can also visualize hyperspectral data as a sliceable 3D cube with interactive analysis capabilities!
Primarily developed for coastal areas, this module by Dr. Qiusheng Wu and team, can be extended for use in all domains using hyperspectral data. Check out a demo video on the 3D visualization capabilities, and follow the tutorials for using this package.
Microsoft Retires Planetary Computer Hub
Image credits: Planetary Computer, 2024
Microsoft announced the retirement of the Planetary Computer Hub on June 6, 2024. This decision was attributed to a shift in Microsoft’s strategic focus towards tightening security requirements across all Microsoft systems. The Hub was essential in aiding researchers and organizations to access and analyze large-scale environmental data effectively. This update solely impacts the Planetary Computer Hub; the Planetary Computer Data and APIs will stay untouched.
For a more insights into this decision and instructions to retrieve your Hub home directory, users can refer to the official GitHub discussion.
Benchmarking DataFrames at Scale: Coiled's TPC-H Analysis
Image credits: Coiled, 2024
Coiled recently conducted a comprehensive benchmark analysis using the TPC-H suite to evaluate the performance of various DataFrame technologies—Spark, Dask, DuckDB, and Polars—across different scales and hardware setups. The study revealed that while no single project consistently outperformed others, DuckDB and Dask showed robust performance across many scenarios. Spark, though widely used, lagged in efficiency and ease of use. Polars excelled in small-scale local tasks but struggled with larger, cloud-based workloads. The findings provide valuable insights for selecting the right tool based on specific use cases and data scales.
For a detailed breakdown of the benchmarks and specific performance metrics, read the full article here.
Upcoming Meetings
- Conference: FOSS4G, Europe 2024
1-7 July 2024, Tartu, Estonia - SURF Training: High-Performance Deep Learning
8-11 July 2024, Amsterdam - SURF Training: Introduction to SURF Research Cloud
25 July 2024, Amsterdam - CRIB Training: Introduction to Geospatial Raster and Vector with R
21-22 August 2024, ITC, Enschede
Recent Releases
- Dask: Flexible library for parallel computing
2024.6.0 (14/6/2024) - cuGraph: GPU accelerated graph algorithms
24.06.1 (13/6/2024) - libspatialindex: C/C++ library for spatial indexes
2.0.0 (7/6/2024) - OpenCV: Open source computer vision library
4.10.0 (3/6/2024)
The "Big" Picture
Image credits: Consoli et al., 2024
Researchers have unveiled SIRCLE (Signal Imputation and Refinement with Convolution Leaded Engine) and SWAG (Seasonally Weighted Average Generalization), two new tools designed to tackle the challenges of processing petabyte-scale Earth Observation (EO) time-series data from missions like NASA's Landsat and ESA's Sentinel. Current solutions offer limited flexibility when handling anomalies like cloud cover. SIRCLE, introduces flexibility in time-series processing through adjustable convolution kernels. SWAG, integrated within SIRCLE, then leverages seasonality in EO data to reconstruct missing values, prioritizing recent images for enhanced accuracy. Benchmark tests reveal that SWAG reduces reconstruction errors by at least 15% compared to other methods. In a significant large-scale application, SIRCLE and SWAG processed the entire Global Land Analysis and Discovery (GLAD) ARD-2 Landsat archive. This effort produced a cloud-free bi-monthly product spanning 1997 to 2022, involving over two trillion pixels. The process was completed in approximately 28 hours using 1248 Intel Xeon CPUs. The processed data, stored as Cloud-Optimized GeoTIFFs (COG), are now open-access, enabling efficient and affordable environmental monitoring and analysis.
Davide Consoli, Leandro Parente, Rolf Simoes et al. A computational framework for processing time-series of Earth Observation data based on discrete convolution: global-scale historical Landsat cloud-free aggregates at 30 m spatial resolution, (23 May, 2024), PREPRINT (Version 1) available at Research Square, https://doi.org/10.21203/rs.3.rs-4465582/v1