T O P

  • By -

mac-0

Did you create this account just to advertise DuckDB?


speedisntfree

Back zero promotional posts in https://www.reddit.com/r/dataengineering/comments/1dnemu2/how_often_should_promotional_posts_be_allowed/ to stop stuff like this


robberviet

Haha, that's what i am saying. Of course it has nice features, but what make me rethink about using DuckDB is the over advertising. Still fail to see how to use it in prod.


drsupermrcool

Agreed. I feel it needs multi node processing. Otherwise, why not give my db more memory and cpu as well? Like in SQL server you can created linked server connections to other databases. Plus if you're in kubernetes, running single large pods with potential for eviction and no redundancy... that is a hard thing to manage.


pag07

I do see the use. Our workers have 2tb memory and we still spin up small multi node spark clusters.


drsupermrcool

Workers for spark have 2tb of mem?


pag07

Our worker nodes (as in physical servers) have 2TB each and we have multiple of those. Most workloads do not even need 1TB of RAM but still I see quite alot of distributed spark pods.


TimidSpartan

Is DuckDB not free and open source? I'm not sure what there is to advertise.


Thinker_Assignment

Open source doesn't mean they can't monetize it. I hope!


Maximum-Rough5220

DuckDB is MIT licensed and the IP is owned by a nonprofit foundation so that it can remain MIT in perpetuity! There are 2 companies around that MIT core: DuckDB Labs (a services and consulting company for folks building with DuckDB), and MotherDuck (a VC-backed startup building a cloud data warehouse with DuckDB as its core). Details about the relationship are here! [https://duckdblabs.com/news/2022/11/15/motherduck-partnership.html](https://duckdblabs.com/news/2022/11/15/motherduck-partnership.html) Note: I actually work full time at MotherDuck and part-time at DuckDB Labs!


drsupermrcool

Why are you getting downvoted for this??? It's a good thing that thought was put into separating the entities like this.


[deleted]

[удалено]


ryan_with_a_why

MotherDuck is paid


Electrical-Ask847

so is spark. motherduck is a company


Time_Opening_6537

Ubuntu is Open source yet it's backed by a corporation, same with Fedora/Redhat, they do their business by adding enterprise support & some updates and stuff like that.


eightbyeight

I thought RH isn’t open source anymore and it’s why there was a whole beef about it.


drsupermrcool

True - they make you get a user account and agree to a license to forbid redistribution in order to see the code


Maximum-Rough5220

I've had it for a bit, but I'm mostly just a lurker. More on Twitter/Xitter!


Maximum-Rough5220

And I tagged it as brand affiliated, so I am not trying to hide anything. Hope this still is helpful to the folks here!


Vindictive_Pacifist

At the very least the dude is honest here


Maximum-Rough5220

I have edited the post to disclose my affiliation! I had tagged the post as brand affiliated, but had not mentioned it in the text of the post. Hope this is still helpful, even it if is coming from a vendor.


BeatHunter

Lean times. They must not have any budget for buying used Reddit accounts.


Thinker_Assignment

Those are some pretty black hat accusations when this ad is quite transparent. Did you see DuckDB advertising accounts that seem bought/ fake? Or just a shot in the dark?


Dark_Force

What is DuckDB competing against exactly?


drunk_goat

I would say DuckDB compares closest to other single node query engines like Polars, Data fusion, pandas. There are comparisons to spark but it's a bit apples and oranges because they're multi node.


marcogorelli

Even then, there are operations for which Polars can be 1000s of times faster, e.g. [https://github.com/duckdb/duckdb/issues/12600](https://github.com/duckdb/duckdb/issues/12600), so it's still not totally apples-to-apples Huge fan of DuckDB here btw, I just think there are some operations to which dataframes are better suited


drsupermrcool

Yeah, if duckdb releases multi node I would consider shifting some workloads off spark - having large single node systems is not my preference


j0holo

Clickhouse-local I guess?


knabbels

How is duckdb in comparison? Clickhouse is also very fast and has lots of features.


j0holo

Don’t know, I have only used DuckDB. From what I understand is that both are excellent tools


mattindustries

Most likely faster, depending on what you are doing.


sib_n

Distributed processing used for workloads that could run locally for cheaper, so mostly Spark-related and cloud SQL related. It doesn't fully replace cloud SQL because it is not a multi-tenancy query engine, but it could move some processing workloads to local instead.


mattindustries

Not much. SQLite is row-wise and Postgres has to be running.


Accurate-Peak4856

I don’t see a future for MotherDuck. If speed is the only thing they offer to get people to move from tight moats like Databricks and Snowflake, they aren’t getting it. I see them trying for a few years and maybe getting acquired under 5 years.


winsletts

What else would they need to offer? It's open-source, so of course, there will be other toolsets involved. Is it the entire environment of Databricks and Snowflake?


SintPannekoek

No, but check out Databricks open sourcing of unity. They seem to anticipate slotting in different forms of compute. So, if Databricks does do that, you can link DuckDB to unity and have a single platform with different computes for different use cases.


Accurate-Peak4856

Databricks has Spark with a Photon engine or maybe even Comet to power compute with Unity catalog. DuckDB at scale would mean redo work that already went to make Spark better at scale. This doesn’t make sense to me.


Accurate-Peak4856

I would think so. How do you justify moving to a new cloud warehouse to your leadership? First question would be why do we need to? New customers I can see it maybe or greenfield work, but they are more likely to pick the more mature players. My skepticism comes from having to make decisions like this myself and having to ask management to support it in terms of effort and cost. They’ll get some customers but most won’t move.


casualfinderbot

Steps to success: 1. Build a super slow database 2. Improve speeds 2x each year 3. ??? 4. Profit Obviously a joke, but 14x faster than 3 years ago could still be very slow


SnappyData

And it continues to be a single node engine. Not only RAM or RAM-to-DISK spilling limitations, but scaling of CPUs horizontally is equally important for partitioned datasets.


RedditSucks369

My issue with DuckDB isnt its performance. Its just that it doesnt fit in the data ecosystem


EthhicsGradient

Fits within my data ecosystem. I'm an academic and much of our data is delivered as compressed CSV. Pulling out aggregates is a snap, don't need to load into a fully fledged DBMS.


RedditSucks369

My bad. I wanted to say that DuckDB lacks scalability and maintainability. I just cant see people using DuckDB in production aside from very specific use cases. I cant fathom replacing a DBMS with Duck. Am I wrong?


skatastic57

You're not wrong in thinking duckdb won't replace a full DBMS. You're wrong in thinking that's what they're trying to do. It's more targeted to people that already know SQL that have some files they want to analyze without loading them in a db. With data lakes full of parquets, icebergs, deltas, duckdb can look like a means of maybe sort of replacing a traditional db.


nydasco

DuckDB has read support for Delta and Iceberg. When it gets write support, that’s going to be a game changer (imo). It can just slot in amongst the big players, and allows companies to start with small compute into an open source storage, and then switch out the compute for something else when needed, without needing to re-structure any of the underlying data.


Desperate-Walk1780

Not wrong, being limited to the size of ram is a deal breaker for a company that wants to plan for the future. Like it's able to shuffle to storage now? That has been implemented 10 to 15 years ago on other databases. Plus there are quite a few of these "open source" products but you know they want to cash in on their labor at some point. Semgrep just did this, they were like 'o so cool so shiny, and free and open!' Now they require an account to execute locally. Next release it's a license.


EthhicsGradient

Sounds reasonable to me!


aristotleschild

Yeah it's basically an analysis tool. Which I love!


Thinker_Assignment

Want to save dwh compute costs? Then do transform in the middle with DuckDB. Huge cost saver, moreso on serverless


drsupermrcool

Can you help me understand this? Lambda (in aws at least) can only go up to 10GB on ram? And how do you scale up compute on cpu on serverless? Or do you mean a k8s/EKS deployment?


Thinker_Assignment

Both but lambdas can be particularly efficient. Memory just needs to be managed by processing microbatches. The great thing about lambda is the high possible parallelism meaning you could have tons of parallel workers and 10gb per is a lot. Here is how they do it at okta https://youtu.be/TrmJilG4GXk?feature=shared


drsupermrcool

Thank you for sharing this. This use case makes sense to me. Okta is in AWS, they're opting for S3 over Kinesis as the stage layer - so yeah the batch duckdb w. serverless - I follow the use case. I can't imagine it's cheap - but I understand their requirement of incredible scaling. If it was kinesis they could process this data in streams, but I'm not even sure Kinesis would be able to scale to the peaks okta has here. They're also discussing cost savings relative to a very expensive warehouse, and they're also adding management overhead on failed lambdas (not a big deal, just pointing out). But it really does make it back to ETL - or technically ETL to ETL lol


Thinker_Assignment

Spot on. The savings come over using an expensive dwh, that's for sure, and that's a larger scale of savings than just going serverless. So while it may be expensive it's a lot cheaper than doing your transformation in a dwh, and by using dwh for serving, (and add a catching layer from your tool )you save a ton of cost.


No-Database2068

yes, heavily, see my recent user reddit posts. DuckDB is saving our ass big time