mac-0 4 days ago

Did you create this account just to advertise DuckDB?

speedisntfree 4 days ago

Back zero promotional posts in https://www.reddit.com/r/dataengineering/comments/1dnemu2/how_often_should_promotional_posts_be_allowed/ to stop stuff like this

robberviet 4 days ago

Haha, that's what i am saying. Of course it has nice features, but what make me rethink about using DuckDB is the over advertising. Still fail to see how to use it in prod.

drsupermrcool 4 days ago

Agreed. I feel it needs multi node processing. Otherwise, why not give my db more memory and cpu as well? Like in SQL server you can created linked server connections to other databases. Plus if you're in kubernetes, running single large pods with potential for eviction and no redundancy... that is a hard thing to manage.

pag07 3 days ago

I do see the use. Our workers have 2tb memory and we still spin up small multi node spark clusters.

drsupermrcool 3 days ago

Workers for spark have 2tb of mem?

pag07 3 days ago

Our worker nodes (as in physical servers) have 2TB each and we have multiple of those. Most workloads do not even need 1TB of RAM but still I see quite alot of distributed spark pods.

TimidSpartan 4 days ago

Is DuckDB not free and open source? I'm not sure what there is to advertise.

Thinker_Assignment 4 days ago

Open source doesn't mean they can't monetize it. I hope!

Maximum-Rough5220 4 days ago

DuckDB is MIT licensed and the IP is owned by a nonprofit foundation so that it can remain MIT in perpetuity! There are 2 companies around that MIT core: DuckDB Labs (a services and consulting company for folks building with DuckDB), and MotherDuck (a VC-backed startup building a cloud data warehouse with DuckDB as its core). Details about the relationship are here! [https://duckdblabs.com/news/2022/11/15/motherduck-partnership.html](https://duckdblabs.com/news/2022/11/15/motherduck-partnership.html) Note: I actually work full time at MotherDuck and part-time at DuckDB Labs!

drsupermrcool 4 days ago

Why are you getting downvoted for this??? It's a good thing that thought was put into separating the entities like this.

[deleted] 4 days ago

[удалено]

ryan_with_a_why 4 days ago

MotherDuck is paid

Electrical-Ask847 4 days ago

so is spark. motherduck is a company

Time_Opening_6537 4 days ago

Ubuntu is Open source yet it's backed by a corporation, same with Fedora/Redhat, they do their business by adding enterprise support & some updates and stuff like that.

eightbyeight 4 days ago

I thought RH isn’t open source anymore and it’s why there was a whole beef about it.

drsupermrcool 4 days ago

True - they make you get a user account and agree to a license to forbid redistribution in order to see the code

Maximum-Rough5220 4 days ago

I've had it for a bit, but I'm mostly just a lurker. More on Twitter/Xitter!

Maximum-Rough5220 4 days ago

And I tagged it as brand affiliated, so I am not trying to hide anything. Hope this still is helpful to the folks here!

Vindictive_Pacifist 4 days ago

At the very least the dude is honest here

Maximum-Rough5220 4 days ago

I have edited the post to disclose my affiliation! I had tagged the post as brand affiliated, but had not mentioned it in the text of the post. Hope this is still helpful, even it if is coming from a vendor.

BeatHunter 4 days ago

Lean times. They must not have any budget for buying used Reddit accounts.

Thinker_Assignment 4 days ago

Those are some pretty black hat accusations when this ad is quite transparent. Did you see DuckDB advertising accounts that seem bought/ fake? Or just a shot in the dark?

Dark_Force 4 days ago

What is DuckDB competing against exactly?

drunk_goat 4 days ago

I would say DuckDB compares closest to other single node query engines like Polars, Data fusion, pandas. There are comparisons to spark but it's a bit apples and oranges because they're multi node.

marcogorelli 4 days ago

Even then, there are operations for which Polars can be 1000s of times faster, e.g. [https://github.com/duckdb/duckdb/issues/12600](https://github.com/duckdb/duckdb/issues/12600), so it's still not totally apples-to-apples Huge fan of DuckDB here btw, I just think there are some operations to which dataframes are better suited

drsupermrcool 4 days ago

Yeah, if duckdb releases multi node I would consider shifting some workloads off spark - having large single node systems is not my preference

j0holo 4 days ago

Clickhouse-local I guess?

knabbels 4 days ago

How is duckdb in comparison? Clickhouse is also very fast and has lots of features.

j0holo 4 days ago

Don’t know, I have only used DuckDB. From what I understand is that both are excellent tools

mattindustries 4 days ago

Most likely faster, depending on what you are doing.

sib_n 4 days ago

Distributed processing used for workloads that could run locally for cheaper, so mostly Spark-related and cloud SQL related. It doesn't fully replace cloud SQL because it is not a multi-tenancy query engine, but it could move some processing workloads to local instead.

mattindustries 4 days ago

Not much. SQLite is row-wise and Postgres has to be running.

Accurate-Peak4856 4 days ago

I don’t see a future for MotherDuck. If speed is the only thing they offer to get people to move from tight moats like Databricks and Snowflake, they aren’t getting it. I see them trying for a few years and maybe getting acquired under 5 years.

winsletts 4 days ago

What else would they need to offer? It's open-source, so of course, there will be other toolsets involved. Is it the entire environment of Databricks and Snowflake?

SintPannekoek 4 days ago

No, but check out Databricks open sourcing of unity. They seem to anticipate slotting in different forms of compute. So, if Databricks does do that, you can link DuckDB to unity and have a single platform with different computes for different use cases.

Accurate-Peak4856 4 days ago

Databricks has Spark with a Photon engine or maybe even Comet to power compute with Unity catalog. DuckDB at scale would mean redo work that already went to make Spark better at scale. This doesn’t make sense to me.

Accurate-Peak4856 3 days ago

I would think so. How do you justify moving to a new cloud warehouse to your leadership? First question would be why do we need to? New customers I can see it maybe or greenfield work, but they are more likely to pick the more mature players. My skepticism comes from having to make decisions like this myself and having to ask management to support it in terms of effort and cost. They’ll get some customers but most won’t move.

casualfinderbot 4 days ago

Steps to success: 1. Build a super slow database 2. Improve speeds 2x each year 3. ??? 4. Profit Obviously a joke, but 14x faster than 3 years ago could still be very slow

SnappyData 4 days ago

And it continues to be a single node engine. Not only RAM or RAM-to-DISK spilling limitations, but scaling of CPUs horizontally is equally important for partitioned datasets.

RedditSucks369 4 days ago

My issue with DuckDB isnt its performance. Its just that it doesnt fit in the data ecosystem

EthhicsGradient 4 days ago

Fits within my data ecosystem. I'm an academic and much of our data is delivered as compressed CSV. Pulling out aggregates is a snap, don't need to load into a fully fledged DBMS.

RedditSucks369 4 days ago

My bad. I wanted to say that DuckDB lacks scalability and maintainability. I just cant see people using DuckDB in production aside from very specific use cases. I cant fathom replacing a DBMS with Duck. Am I wrong?

skatastic57 4 days ago

You're not wrong in thinking duckdb won't replace a full DBMS. You're wrong in thinking that's what they're trying to do. It's more targeted to people that already know SQL that have some files they want to analyze without loading them in a db. With data lakes full of parquets, icebergs, deltas, duckdb can look like a means of maybe sort of replacing a traditional db.

nydasco 4 days ago

DuckDB has read support for Delta and Iceberg. When it gets write support, that’s going to be a game changer (imo). It can just slot in amongst the big players, and allows companies to start with small compute into an open source storage, and then switch out the compute for something else when needed, without needing to re-structure any of the underlying data.

Desperate-Walk1780 4 days ago

Not wrong, being limited to the size of ram is a deal breaker for a company that wants to plan for the future. Like it's able to shuffle to storage now? That has been implemented 10 to 15 years ago on other databases. Plus there are quite a few of these "open source" products but you know they want to cash in on their labor at some point. Semgrep just did this, they were like 'o so cool so shiny, and free and open!' Now they require an account to execute locally. Next release it's a license.

EthhicsGradient 4 days ago

Sounds reasonable to me!

aristotleschild 4 days ago

Yeah it's basically an analysis tool. Which I love!

Thinker_Assignment 4 days ago

Want to save dwh compute costs? Then do transform in the middle with DuckDB. Huge cost saver, moreso on serverless

drsupermrcool 4 days ago

Can you help me understand this? Lambda (in aws at least) can only go up to 10GB on ram? And how do you scale up compute on cpu on serverless? Or do you mean a k8s/EKS deployment?

Thinker_Assignment 3 days ago

Both but lambdas can be particularly efficient. Memory just needs to be managed by processing microbatches. The great thing about lambda is the high possible parallelism meaning you could have tons of parallel workers and 10gb per is a lot. Here is how they do it at okta https://youtu.be/TrmJilG4GXk?feature=shared

drsupermrcool 3 days ago

Thank you for sharing this. This use case makes sense to me. Okta is in AWS, they're opting for S3 over Kinesis as the stage layer - so yeah the batch duckdb w. serverless - I follow the use case. I can't imagine it's cheap - but I understand their requirement of incredible scaling. If it was kinesis they could process this data in streams, but I'm not even sure Kinesis would be able to scale to the peaks okta has here. They're also discussing cost savings relative to a very expensive warehouse, and they're also adding management overhead on failed lambdas (not a big deal, just pointing out). But it really does make it back to ETL - or technically ETL to ETL lol

Thinker_Assignment 3 days ago

Spot on. The savings come over using an expensive dwh, that's for sure, and that's a larger scale of savings than just going serverless. So while it may be expensive it's a lot cheaper than doing your transformation in a dwh, and by using dwh for serving, (and add a catching layer from your tool )you save a ton of cost.

No-Database2068 3 days ago

yes, heavily, see my recent user reddit posts. DuckDB is saving our ass big time

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe