Home ITCAbout ITCCentres of expertiseCenter of Expertise in Big Geodata ScienceBig Geodata TalksHigh-performance Spatial Data Management and Analysis with DuckDB
DuckDB

High-performance Spatial Data Management and Analysis with DuckDB

Join us for the Big Geodata Talk on DuckDB Spatial!

DuckDB is a novel in-process SQL database system designed for analytical workloads that has been making waves in the data science and engineering community. Not only for its impressive performance, but also for its focus on ease of use and integrations with the wider data ecosystem. A key part in making this possible is DuckDB's flexible extension system that enable DuckDB to be used across different domains while the core system itself remains small and focused. One such extension is the DuckDB Spatial Extension which brings geospatial data processing capabilities to DuckDB, allowing users to perform complex spatial queries and transformations. By incorporating the trifecta of foundational open source GIS libraries: GDAL, GEOS and PROJ as well as natively implemented geospatial algorithms all neatly packaged into a single binary with no runtime dependencies, the spatial extension provides hundreds of familiar spatial SQL functions and import and export capabilities to and from dozens of different vector file formats.

Just like DuckDB tries to default to the behavior or PostgreSQL, the spatial extension is heavily inspired by PostGIS and similarly follows the Simple Features SQL standard. However, while the Simple Features geometry model undoubtedly provides a great deal of flexibility with its hierarchy of subtypes (points, linestrings, multipolygons) and optional Z and M dimensions, it is not always the most efficient representation for modern high performance processing. While the spatial extension implements a bunch of geospatial algorithms natively to try to make the most of DuckDBs vectorized execution engine and memory model, it also complements the GEOMETRY type that we all know and love with a new set of strongly typed spatial types backed by a columnar storage model, similarly to what is being proposed in the GeoArrow project. This makes DuckDBs spatial extension an exciting project as it stands with one foot firmly in the traditional open source GIS world and the other in the modern data science and engineering movement.

In this talk, we will introduce DuckDB and the DuckDB Spatial Extension, walk through some of the internals that make DuckDB special as well as some of the challenges and design decisions encountered when adapting it for geospatial processing. We will also showcase some of the main features the spatial extension brings to the table today and share some insights into the future of the project.

Date

31 May 2024, 11:00-12:00 CET

Venue

ITC Langezijds Building, Room LA 1212
Hallenweg 8, 7522 NH Enschede

or

Online

Registration

Registration is closed. You can participate the event online at https://itc.nl/live.

Speaker

Max Gabrielsson
Software Engineer, DuckDB Labs

Max Gabrielsson is a software engineer at DuckDB Labs where he works on the DuckDB database system in general and is the primary developer of the spatial extension. Previously he's held a variety of roles at different small-scale startups and co-founded a company selling maps, which sparked his interest in geospatial data management and processing. Max holds a BSc in Computer Science from Uppsala University.

Video

Presentation

Questions and Answers

  • Many people approach a geospatial database with a GIS client for mapping purposes. Is this possible with DuckDB configurations and how would you set that up?

    Yes, there are actually two things in this space. There is a QGIS plug-in developed by a third party. I haven’t used it myself, but surely offloads some heavy computations into DuckDB. There is also DBeaver plugin for DuckDB that has spatial support, but doesn’t have all GIS functionality. In the future that is what I want to make easier and make DuckDB more spatially interoperable.

  • You spoke about possible implementation of geometries by using structs, but you have again "records" at that time? Did I misinterpret it?

    DuckDB structs are a little different because you don’t get the whole record in one row. You basically get separate columns for each field. So its kind of like a struct with an array layout. Looks internally very similar to just having another table basically inside of the column.

  • You say you've implemented 100 plus spatial functions. In PostGIS last time I checked, there are 1000 plus or so. Do you expect more complications as you go?

    Yes, that's a good question. I think initially when I first published, I kind of wanted to translate everything that was provided. Then it kind of caught up to me in a way and we thought of how this model is represented internally. So there's been a lot of refactoring the last six months, but I hope now we can get back on adding features.

  • Do you have optimizations specific to geospatial functions in place? How does it work?

    We have two things going on. One is, pushing down predicates, like for especially read functions, you read from GDAL, and you then need to also filter out some other geometry predicates like zones that intersect, then you push down that comparison into GDAL, which for some formats, for some drivers, can be more efficiently executed there.

    The main optimization that we have in place is for spatial joins. Normally when you do a join, you get what’s called a nested loop join. It's basically comparing everything with everything, so it has like n x m complexity. It's really not great and is pretty expensive. What we do in the serialized format is that we cache the bounding box of the geometry in the beginning. There we kind of sacrifice some write speed when you do update and insert, as we have to compute the bounding box. But in DuckDB case, it make sense because you probably going to read from an analytical database after all right?  Then having a bounding box, when you do a join, at least an inner join, on special predators, you rewrite it to turn it into a range join on the bounding box first and then we will move out the predicate as an extra filter outside. This sounds complicated, but DuckDB has a really efficient range join operator. 

    We haven’t implemented a proper spatial join yet, but it is something that I am really look forward to get started with. DuckDB itself has a really good hash join, but yes, the spatial counterpart doesn’t match up as well.

  • Introducing new datatypes makes sense, but do you think it can be done in a hidden way to the user? Because the user doesn't need to know this implementation details.

    Yes, in general it is pretty transparent. For this new geometry types, they are always implicitly castable to their corresponding geometry types. So the idea is if you want to use functions that are implemented for the special types, you get those optimized functions, otherwise you perform on implicit cast and use the other available functions. This casting is not much expensive.

    I tried to make it a little friendly for user, but I do think at some point you need some way to differentiate between the implementations. I think doing that at the time level is probably the best way to do it.

  • If you are going to host data on the cloud and you have multiple users to access it at the same time, how is it possible?

    DuckDB can’t do writes remotely, but it can do reads remotely, even for the DuckDB file format itself. I think a really common thing I see DuckDB in use is basically you export your datafiles as parquet files onto something like S3 and then everyone can have their own instance of DuckDB reading these files.

    Now there are also some new cloud data warehouse solutions like Iceberg and Delta Lake, that try build into updates and inserts on top of parquet files and DuckDB does have extensions to support both Iceberg and Delta Lake to some degree. I think they are both not currently future complete, but this is a common thing I see.

    I think for a lot of people just having a database would be good. But I think what makes a lot of people excited about DuckDB today is that you can kind of decouple compute and  storage by running these things on something like S3.

    Again, a very common thing I see is that people slapping DuckDB into a Lambda function, and have some batch file that produces S3 files overnight and then you can query it using DuckDB through a web server on AWS. Then you have built your own mini data warehouse.

  • What don’t you use DuckDB with native file format in that case, why GeoParquet?

    Good question, I think people are really afraid that there’s some vendor lock. Only DuckDB can read this file format, while parquet has implementations in most languages. Some people are interested in using DuckDB as a kind of cloud native format as well, but my personal opinion is that that’s going to be difficult, because the DuckDB format has some very DuckDB specific things that makes it hard to read from a third party implementation.