High-performance Spatial Data Management and Analysis with DuckDB

Date

31 May 2024, 11:00-12:00 CET

Venue

ITC Langezijds Building, Room LA 1212
Hallenweg 8, 7522 NH Enschede

or

Online

Registration

Registration is closed. You can participate the event online at https://itc.nl/live.

Speaker

Max Gabrielsson

Software Engineer, DuckDB Labs

Max Gabrielsson is a software engineer at DuckDB Labs where he works on the DuckDB database system in general and is the primary developer of the spatial extension. Previously he's held a variety of roles at different small-scale startups and co-founded a company selling maps, which sparked his interest in geospatial data management and processing. Max holds a BSc in Computer Science from Uppsala University.

max@duckdblabs.com

Connect with me on LinkedIn

Video

This video player uses cookies

The YouTube player uses marketing cookies. Accept third-party cookies to load the player. Watch without cookies on the YouTube website.

Presentation

Questions and Answers

Many people approach a geospatial database with a GIS client for mapping purposes. Is this possible with DuckDB configurations and how would you set that up?
Yes, there are actually two things in this space. There is a QGIS plug-in developed by a third party. I haven’t used it myself, but surely offloads some heavy computations into DuckDB. There is also DBeaver plugin for DuckDB that has spatial support, but doesn’t have all GIS functionality. In the future that is what I want to make easier and make DuckDB more spatially interoperable.
This answers my question.
Yes
No
Thank you for your feedback
Name
Email
Your question
Thank you for your question
* indicates a required field
You spoke about possible implementation of geometries by using structs, but you have again "records" at that time? Did I misinterpret it?
DuckDB structs are a little different because you don’t get the whole record in one row. You basically get separate columns for each field. So its kind of like a struct with an array layout. Looks internally very similar to just having another table basically inside of the column.
This answers my question.
Yes
No
Thank you for your feedback
Name
Email
Your question
Thank you for your question
* indicates a required field
You say you've implemented 100 plus spatial functions. In PostGIS last time I checked, there are 1000 plus or so. Do you expect more complications as you go?
Yes, that's a good question. I think initially when I first published, I kind of wanted to translate everything that was provided. Then it kind of caught up to me in a way and we thought of how this model is represented internally. So there's been a lot of refactoring the last six months, but I hope now we can get back on adding features.
This answers my question.
Yes
No
Thank you for your feedback
Name
Email
Your question
Thank you for your question
* indicates a required field
Do you have optimizations specific to geospatial functions in place? How does it work?
We have two things going on. One is, pushing down predicates, like for especially read functions, you read from GDAL, and you then need to also filter out some other geometry predicates like zones that intersect, then you push down that comparison into GDAL, which for some formats, for some drivers, can be more efficiently executed there.
The main optimization that we have in place is for spatial joins. Normally when you do a join, you get what’s called a nested loop join. It's basically comparing everything with everything, so it has like n x m complexity. It's really not great and is pretty expensive. What we do in the serialized format is that we cache the bounding box of the geometry in the beginning. There we kind of sacrifice some write speed when you do update and insert, as we have to compute the bounding box. But in DuckDB case, it make sense because you probably going to read from an analytical database after all right? Then having a bounding box, when you do a join, at least an inner join, on special predators, you rewrite it to turn it into a range join on the bounding box first and then we will move out the predicate as an extra filter outside. This sounds complicated, but DuckDB has a really efficient range join operator.
We haven’t implemented a proper spatial join yet, but it is something that I am really look forward to get started with. DuckDB itself has a really good hash join, but yes, the spatial counterpart doesn’t match up as well.
This answers my question.
Yes
No
Thank you for your feedback
Name
Email
Your question
Thank you for your question
* indicates a required field
Introducing new datatypes makes sense, but do you think it can be done in a hidden way to the user? Because the user doesn't need to know this implementation details.
Yes, in general it is pretty transparent. For this new geometry types, they are always implicitly castable to their corresponding geometry types. So the idea is if you want to use functions that are implemented for the special types, you get those optimized functions, otherwise you perform on implicit cast and use the other available functions. This casting is not much expensive.
I tried to make it a little friendly for user, but I do think at some point you need some way to differentiate between the implementations. I think doing that at the time level is probably the best way to do it.
This answers my question.
Yes
No
Thank you for your feedback
Name
Email
Your question
Thank you for your question
* indicates a required field
If you are going to host data on the cloud and you have multiple users to access it at the same time, how is it possible?
DuckDB can’t do writes remotely, but it can do reads remotely, even for the DuckDB file format itself. I think a really common thing I see DuckDB in use is basically you export your datafiles as parquet files onto something like S3 and then everyone can have their own instance of DuckDB reading these files.
Now there are also some new cloud data warehouse solutions like Iceberg and Delta Lake, that try build into updates and inserts on top of parquet files and DuckDB does have extensions to support both Iceberg and Delta Lake to some degree. I think they are both not currently future complete, but this is a common thing I see.
I think for a lot of people just having a database would be good. But I think what makes a lot of people excited about DuckDB today is that you can kind of decouple compute and storage by running these things on something like S3.
Again, a very common thing I see is that people slapping DuckDB into a Lambda function, and have some batch file that produces S3 files overnight and then you can query it using DuckDB through a web server on AWS. Then you have built your own mini data warehouse.
This answers my question.
Yes
No
Thank you for your feedback
Name
Email
Your question
Thank you for your question
* indicates a required field
What don’t you use DuckDB with native file format in that case, why GeoParquet?
Good question, I think people are really afraid that there’s some vendor lock. Only DuckDB can read this file format, while parquet has implementations in most languages. Some people are interested in using DuckDB as a kind of cloud native format as well, but my personal opinion is that that’s going to be difficult, because the DuckDB format has some very DuckDB specific things that makes it hard to read from a third party implementation.
This answers my question.
Yes
No
Thank you for your feedback
Name
Email
Your question
Thank you for your question
* indicates a required field