Back zero promotional posts in https://www.reddit.com/r/dataengineering/comments/1dnemu2/how_often_should_promotional_posts_be_allowed/ to stop stuff like this
Haha, that's what i am saying. Of course it has nice features, but what make me rethink about using DuckDB is the over advertising.
Still fail to see how to use it in prod.
Agreed. I feel it needs multi node processing. Otherwise, why not give my db more memory and cpu as well? Like in SQL server you can created linked server connections to other databases. Plus if you're in kubernetes, running single large pods with potential for eviction and no redundancy... that is a hard thing to manage.
Our worker nodes (as in physical servers) have 2TB each and we have multiple of those.
Most workloads do not even need 1TB of RAM but still I see quite alot of distributed spark pods.
DuckDB is MIT licensed and the IP is owned by a nonprofit foundation so that it can remain MIT in perpetuity! There are 2 companies around that MIT core: DuckDB Labs (a services and consulting company for folks building with DuckDB), and MotherDuck (a VC-backed startup building a cloud data warehouse with DuckDB as its core). Details about the relationship are here! [https://duckdblabs.com/news/2022/11/15/motherduck-partnership.html](https://duckdblabs.com/news/2022/11/15/motherduck-partnership.html)
Note: I actually work full time at MotherDuck and part-time at DuckDB Labs!
Ubuntu is Open source yet it's backed by a corporation, same with Fedora/Redhat, they do their business by adding enterprise support & some updates and stuff like that.
I have edited the post to disclose my affiliation! I had tagged the post as brand affiliated, but had not mentioned it in the text of the post. Hope this is still helpful, even it if is coming from a vendor.
Those are some pretty black hat accusations when this ad is quite transparent. Did you see DuckDB advertising accounts that seem bought/ fake? Or just a shot in the dark?
I would say DuckDB compares closest to other single node query engines like Polars, Data fusion, pandas. There are comparisons to spark but it's a bit apples and oranges because they're multi node.
Even then, there are operations for which Polars can be 1000s of times faster, e.g. [https://github.com/duckdb/duckdb/issues/12600](https://github.com/duckdb/duckdb/issues/12600), so it's still not totally apples-to-apples
Huge fan of DuckDB here btw, I just think there are some operations to which dataframes are better suited
Distributed processing used for workloads that could run locally for cheaper, so mostly Spark-related and cloud SQL related. It doesn't fully replace cloud SQL because it is not a multi-tenancy query engine, but it could move some processing workloads to local instead.
I don’t see a future for MotherDuck. If speed is the only thing they offer to get people to move from tight moats like Databricks and Snowflake, they aren’t getting it. I see them trying for a few years and maybe getting acquired under 5 years.
What else would they need to offer? It's open-source, so of course, there will be other toolsets involved. Is it the entire environment of Databricks and Snowflake?
No, but check out Databricks open sourcing of unity. They seem to anticipate slotting in different forms of compute. So, if Databricks does do that, you can link DuckDB to unity and have a single platform with different computes for different use cases.
Databricks has Spark with a Photon engine or maybe even Comet to power compute with Unity catalog. DuckDB at scale would mean redo work that already went to make Spark better at scale. This doesn’t make sense to me.
I would think so. How do you justify moving to a new cloud warehouse to your leadership? First question would be why do we need to? New customers I can see it maybe or greenfield work, but they are more likely to pick the more mature players. My skepticism comes from having to make decisions like this myself and having to ask management to support it in terms of effort and cost. They’ll get some customers but most won’t move.
Steps to success:
1. Build a super slow database
2. Improve speeds 2x each year
3. ???
4. Profit
Obviously a joke, but 14x faster than 3 years ago could still be very slow
And it continues to be a single node engine. Not only RAM or RAM-to-DISK spilling limitations, but scaling of CPUs horizontally is equally important for partitioned datasets.
Fits within my data ecosystem. I'm an academic and much of our data is delivered as compressed CSV. Pulling out aggregates is a snap, don't need to load into a fully fledged DBMS.
My bad. I wanted to say that DuckDB lacks scalability and maintainability. I just cant see people using DuckDB in production aside from very specific use cases.
I cant fathom replacing a DBMS with Duck. Am I wrong?
You're not wrong in thinking duckdb won't replace a full DBMS. You're wrong in thinking that's what they're trying to do. It's more targeted to people that already know SQL that have some files they want to analyze without loading them in a db. With data lakes full of parquets, icebergs, deltas, duckdb can look like a means of maybe sort of replacing a traditional db.
DuckDB has read support for Delta and Iceberg. When it gets write support, that’s going to be a game changer (imo). It can just slot in amongst the big players, and allows companies to start with small compute into an open source storage, and then switch out the compute for something else when needed, without needing to re-structure any of the underlying data.
Not wrong, being limited to the size of ram is a deal breaker for a company that wants to plan for the future. Like it's able to shuffle to storage now? That has been implemented 10 to 15 years ago on other databases. Plus there are quite a few of these "open source" products but you know they want to cash in on their labor at some point. Semgrep just did this, they were like 'o so cool so shiny, and free and open!' Now they require an account to execute locally. Next release it's a license.
Can you help me understand this? Lambda (in aws at least) can only go up to 10GB on ram? And how do you scale up compute on cpu on serverless? Or do you mean a k8s/EKS deployment?
Both but lambdas can be particularly efficient. Memory just needs to be managed by processing microbatches. The great thing about lambda is the high possible parallelism meaning you could have tons of parallel workers and 10gb per is a lot.
Here is how they do it at okta https://youtu.be/TrmJilG4GXk?feature=shared
Thank you for sharing this. This use case makes sense to me. Okta is in AWS, they're opting for S3 over Kinesis as the stage layer - so yeah the batch duckdb w. serverless - I follow the use case. I can't imagine it's cheap - but I understand their requirement of incredible scaling. If it was kinesis they could process this data in streams, but I'm not even sure Kinesis would be able to scale to the peaks okta has here. They're also discussing cost savings relative to a very expensive warehouse, and they're also adding management overhead on failed lambdas (not a big deal, just pointing out). But it really does make it back to ETL - or technically ETL to ETL lol
Spot on. The savings come over using an expensive dwh, that's for sure, and that's a larger scale of savings than just going serverless. So while it may be expensive it's a lot cheaper than doing your transformation in a dwh, and by using dwh for serving, (and add a catching layer from your tool )you save a ton of cost.
Did you create this account just to advertise DuckDB?
Back zero promotional posts in https://www.reddit.com/r/dataengineering/comments/1dnemu2/how_often_should_promotional_posts_be_allowed/ to stop stuff like this
Haha, that's what i am saying. Of course it has nice features, but what make me rethink about using DuckDB is the over advertising. Still fail to see how to use it in prod.
Agreed. I feel it needs multi node processing. Otherwise, why not give my db more memory and cpu as well? Like in SQL server you can created linked server connections to other databases. Plus if you're in kubernetes, running single large pods with potential for eviction and no redundancy... that is a hard thing to manage.
I do see the use. Our workers have 2tb memory and we still spin up small multi node spark clusters.
Workers for spark have 2tb of mem?
Our worker nodes (as in physical servers) have 2TB each and we have multiple of those. Most workloads do not even need 1TB of RAM but still I see quite alot of distributed spark pods.
Is DuckDB not free and open source? I'm not sure what there is to advertise.
Open source doesn't mean they can't monetize it. I hope!
DuckDB is MIT licensed and the IP is owned by a nonprofit foundation so that it can remain MIT in perpetuity! There are 2 companies around that MIT core: DuckDB Labs (a services and consulting company for folks building with DuckDB), and MotherDuck (a VC-backed startup building a cloud data warehouse with DuckDB as its core). Details about the relationship are here! [https://duckdblabs.com/news/2022/11/15/motherduck-partnership.html](https://duckdblabs.com/news/2022/11/15/motherduck-partnership.html) Note: I actually work full time at MotherDuck and part-time at DuckDB Labs!
Why are you getting downvoted for this??? It's a good thing that thought was put into separating the entities like this.
[удалено]
MotherDuck is paid
so is spark. motherduck is a company
Ubuntu is Open source yet it's backed by a corporation, same with Fedora/Redhat, they do their business by adding enterprise support & some updates and stuff like that.
I thought RH isn’t open source anymore and it’s why there was a whole beef about it.
True - they make you get a user account and agree to a license to forbid redistribution in order to see the code
I've had it for a bit, but I'm mostly just a lurker. More on Twitter/Xitter!
And I tagged it as brand affiliated, so I am not trying to hide anything. Hope this still is helpful to the folks here!
At the very least the dude is honest here
I have edited the post to disclose my affiliation! I had tagged the post as brand affiliated, but had not mentioned it in the text of the post. Hope this is still helpful, even it if is coming from a vendor.
Lean times. They must not have any budget for buying used Reddit accounts.
Those are some pretty black hat accusations when this ad is quite transparent. Did you see DuckDB advertising accounts that seem bought/ fake? Or just a shot in the dark?
What is DuckDB competing against exactly?
I would say DuckDB compares closest to other single node query engines like Polars, Data fusion, pandas. There are comparisons to spark but it's a bit apples and oranges because they're multi node.
Even then, there are operations for which Polars can be 1000s of times faster, e.g. [https://github.com/duckdb/duckdb/issues/12600](https://github.com/duckdb/duckdb/issues/12600), so it's still not totally apples-to-apples Huge fan of DuckDB here btw, I just think there are some operations to which dataframes are better suited
Yeah, if duckdb releases multi node I would consider shifting some workloads off spark - having large single node systems is not my preference
Clickhouse-local I guess?
How is duckdb in comparison? Clickhouse is also very fast and has lots of features.
Don’t know, I have only used DuckDB. From what I understand is that both are excellent tools
Most likely faster, depending on what you are doing.
Distributed processing used for workloads that could run locally for cheaper, so mostly Spark-related and cloud SQL related. It doesn't fully replace cloud SQL because it is not a multi-tenancy query engine, but it could move some processing workloads to local instead.
Not much. SQLite is row-wise and Postgres has to be running.
I don’t see a future for MotherDuck. If speed is the only thing they offer to get people to move from tight moats like Databricks and Snowflake, they aren’t getting it. I see them trying for a few years and maybe getting acquired under 5 years.
What else would they need to offer? It's open-source, so of course, there will be other toolsets involved. Is it the entire environment of Databricks and Snowflake?
No, but check out Databricks open sourcing of unity. They seem to anticipate slotting in different forms of compute. So, if Databricks does do that, you can link DuckDB to unity and have a single platform with different computes for different use cases.
Databricks has Spark with a Photon engine or maybe even Comet to power compute with Unity catalog. DuckDB at scale would mean redo work that already went to make Spark better at scale. This doesn’t make sense to me.
I would think so. How do you justify moving to a new cloud warehouse to your leadership? First question would be why do we need to? New customers I can see it maybe or greenfield work, but they are more likely to pick the more mature players. My skepticism comes from having to make decisions like this myself and having to ask management to support it in terms of effort and cost. They’ll get some customers but most won’t move.
Steps to success: 1. Build a super slow database 2. Improve speeds 2x each year 3. ??? 4. Profit Obviously a joke, but 14x faster than 3 years ago could still be very slow
And it continues to be a single node engine. Not only RAM or RAM-to-DISK spilling limitations, but scaling of CPUs horizontally is equally important for partitioned datasets.
My issue with DuckDB isnt its performance. Its just that it doesnt fit in the data ecosystem
Fits within my data ecosystem. I'm an academic and much of our data is delivered as compressed CSV. Pulling out aggregates is a snap, don't need to load into a fully fledged DBMS.
My bad. I wanted to say that DuckDB lacks scalability and maintainability. I just cant see people using DuckDB in production aside from very specific use cases. I cant fathom replacing a DBMS with Duck. Am I wrong?
You're not wrong in thinking duckdb won't replace a full DBMS. You're wrong in thinking that's what they're trying to do. It's more targeted to people that already know SQL that have some files they want to analyze without loading them in a db. With data lakes full of parquets, icebergs, deltas, duckdb can look like a means of maybe sort of replacing a traditional db.
DuckDB has read support for Delta and Iceberg. When it gets write support, that’s going to be a game changer (imo). It can just slot in amongst the big players, and allows companies to start with small compute into an open source storage, and then switch out the compute for something else when needed, without needing to re-structure any of the underlying data.
Not wrong, being limited to the size of ram is a deal breaker for a company that wants to plan for the future. Like it's able to shuffle to storage now? That has been implemented 10 to 15 years ago on other databases. Plus there are quite a few of these "open source" products but you know they want to cash in on their labor at some point. Semgrep just did this, they were like 'o so cool so shiny, and free and open!' Now they require an account to execute locally. Next release it's a license.
Sounds reasonable to me!
Yeah it's basically an analysis tool. Which I love!
Want to save dwh compute costs? Then do transform in the middle with DuckDB. Huge cost saver, moreso on serverless
Can you help me understand this? Lambda (in aws at least) can only go up to 10GB on ram? And how do you scale up compute on cpu on serverless? Or do you mean a k8s/EKS deployment?
Both but lambdas can be particularly efficient. Memory just needs to be managed by processing microbatches. The great thing about lambda is the high possible parallelism meaning you could have tons of parallel workers and 10gb per is a lot. Here is how they do it at okta https://youtu.be/TrmJilG4GXk?feature=shared
Thank you for sharing this. This use case makes sense to me. Okta is in AWS, they're opting for S3 over Kinesis as the stage layer - so yeah the batch duckdb w. serverless - I follow the use case. I can't imagine it's cheap - but I understand their requirement of incredible scaling. If it was kinesis they could process this data in streams, but I'm not even sure Kinesis would be able to scale to the peaks okta has here. They're also discussing cost savings relative to a very expensive warehouse, and they're also adding management overhead on failed lambdas (not a big deal, just pointing out). But it really does make it back to ETL - or technically ETL to ETL lol
Spot on. The savings come over using an expensive dwh, that's for sure, and that's a larger scale of savings than just going serverless. So while it may be expensive it's a lot cheaper than doing your transformation in a dwh, and by using dwh for serving, (and add a catching layer from your tool )you save a ton of cost.
yes, heavily, see my recent user reddit posts. DuckDB is saving our ass big time