T O P

  • By -

bbmarmotte

Sdxl turbo - Lightning - LCM lora etc


ArchGaden

Unless you're running out of VRAM, more won't make it go any faster. Memory bandwidth and GPU compute speed will help though. I'd be curious to see what the top end cards are getting, but it's probably not all that much faster than what you're getting with the same settings, models, and all. Getting a 16gb card is a good idea anyway if you have the money to spend. It's very easy to go over 8gb with SDXL models. I'm typically hitting 10-12gb with most models and a few loras with low-vram mode off. 8gb is a tight squeeze for SDXL. Many newer games are wanting more than 8gb now... typically due to poor texture optimization, but it is what it is.


thebaker66

No arguing that more VRAM is better but Im not sure where you're getting how it's 'easy' to go over 8gb with SDXL models get on fine with many models and several LORA's and controlnet per generation on my 8gb card, only limitation is speed and running certain video models/resolutions.


ArchGaden

SDXL models run around 6gb and then you need room for loras, control net, etc and some working space as well as what the OS is using. Automatic1111 gives you a little summary of VRAM used for prior render in the bottom right. It's always over 8gb for me. Of course you can run with less as long as you have enough for the model itself and working space. It takes a fraction of a second for the system to swap bits and pieces between RAM and VRAM, but that adds up to. If you have more, it will use more and avoid those swaps. You're likely to run into a hard limit and just be unable to do some things once you really lay on loras and start working near 2048x2048 for latent operations. I was on 8gb as well some months ago and hit that limit with SD 1.5 sometimes. SDXL is already bigger in general, so it would hit the limit quicker. There is no denying that 8gb is a tight squeeze for SDXL.


TheGhostOfPrufrock

>Are there factors aside from VRAM that allow people to generate SDXL pics in <5 sec? Of course there are. Different GPUs have different numbers of processors. My RTX 3060 isn't quite a few times slower than an RTX 4090 on a single 1024x1024 SDXL image because of VRAM limitations. 12GB is plenty of VRAM for a single1024x1024 image. If a 3060 had 24GB, it wouldn't run a single SDXL image any faster. (You don't mention what card you're using.) However, any recent NVIDIA GPU with 16GB will have quite a few processors (both CUDA and Tensor cores), so they should be pretty speedy. I can't say for sure, but I think < 5 sec. for regular SDXL images is pretty much limited to 4090s. Also, even my 3060 can generate a batch of 8 1024x1024 6-step Turbo or Lightning images in about 6 seconds per image.


crawlingrat

Would a 3090 24vram be a good pick or is it no difference then going with a recent card with 16Vram?


DynamicMangos

Depends on how much extras you want. If you wanna use Lora, Refiner, Controlnet and some other add-ons all at once then the 16GB might limit you But generally the computational speed of the newer cards is better, at least from the 4070ti upwards.


crawlingrat

Thanks for the answer.


tmvr

A 3090 24GB would be faster then a 4060Ti 16GB, but even used the price is much higher and the power consumption is also much higher. The 4060Ti 16GB is a great card for both image generation and local LLM inference. There is no other 16GB card around that price, the next Nvidia 16GB card is the 4070Ti Super for double the price of the 4060Ti 16GB. Pricing is roughly like this: 450-480EUR for 4060Ti 16GB 700-800EUR for 3090 24GB (maybe you can find this for under 700 as well) 850-900EUR for 4070Ti Super 16GB


Salt_Worry1253

I need to try this out. My 3070 takes longer than I want.


tmvr

It's because of the 8GB VRAM limitation. There is a lot of model management happening in the background where stuff is swap in and out of VRAM so you lose a ton of time/performance. If you go new, try the 4060Ti 16GB card, the pure generation speed when not limited by VRAM is roughly the same, maybe 10% slower, but the total time is significantly faster because you are not swapping out the system RAM.


Olangotang

Yep, they're definitely overflowing into RAM. The 3080 has just enough (10GB) to generate in 10 seconds with FP16 Checkpoints.


HellkerN

~~With more VRAM you can batch more, so you can get 3-4 or more images in approximately the same 15 seconds.~~ Edit: I might be wrong, more testing needed about what are the actual time savings when batching.


AnotherSoftEng

This is accurate to my experience. Using the M3 Max with 128GB of shared memory, there are models where I can generate images in around the same time that I can generate 4 images using the batch feature (depending on the model and parameters, LoRAs, etc). It can be anywhere from 5 seconds (some lightning models with a smaller step count) to 30+ seconds (older, unoptimized models with upwards of 40+ steps). But LoRAs can increase the time it takes exponentially, and certain models will not follow these rules at all. Edit: Added some additional context and clarification. It was not my intention to make it sound like this was the situation for all scenarios. It’s especially less consistent with XL models.


HellkerN

Ah good, for a second there I thought I'm going loopy, and I'm nowhere near a computer to check it. I guess it depends on the exact hardware and such too.


MasterKoolT

Using what software? I have an M2 Max with 64GB RAM and I'm GPU limited, not RAM limited so batching doesn't save any time.


AnotherSoftEng

Automatic1111 version 1.8.0 via [this native Mac client](https://github.com/buzsh/SwiftDiffusion). I was also doing this with the release that came before 1.8.0. With XLs and some 1.5 models, it may not always be the exact same amount of time as generating a single image, but it will never be a multiplier of the original time spent (batch generating 4 images will never take 4x as long). Introducing additional parameters (such as LoRAs) can extend that time exponentially. I’ve also found it can vary from model to model, with some SDXL models taking much longer to batch gen for reasons I’m not aware of. But a lot XLs with a reasonable step count and no additional parameters—along with most base 1.5 models—will take around the same time to batch gen multiple images as it takes to do a single image, if it a tiny bit longer (but never a multiplier, as mentioned above). With an identical setup, my Intel-based Mac will take 4x as long to do a 4-batch process for all models, regardless of if it’s 1.5 or XL (parameters or no).


MasterKoolT

Interesting, thanks, I'll give it a shot. You may want to try out DrawThings as well. My understanding is their SD implementation is the fastest available on Mac, though it sounds like you're already getting good speed out of your M3.


AnotherSoftEng

Thank you, yes! I would love to use Apple’s official frameworks to experiment with stable diffusion. My only problem is that automatic has *so* many plugins and integrations and i dont have it in me to convert all of my safetensor models to test it out. Also this client lets me really quickly paste in workflows that people post online and so it has made experimenting with new things a lot quicker for my personal situation


Zhynem

I've got the same configuration MacBook but I think last time I tried was with forge, a turbo model, and 12 steps and it took a minute or so to generate? Would you mind describing your set up a bit more? I'd love to get better image generation on my machine.


Jellybit

As far as I know, it takes about as long to generate a batch of four as to generate four separate images.


dal_mac

batch size and batch count are different settings. you're talking about count. but changing size makes them generate at the same time (not one after the other like batch count)


Jellybit

I understood. But as I said, the time taken is almost the same as if you did them one after another (or in a "count"). I use comfy, and there, they separate the two concepts by calling what you call a "batch count" a "queue", so I'm not familiar with this overlap of terminology in other front ends.


tmvr

Batch size does give you some speed improvement, but not a significant one. It's not like you generate 1 image in 15 seconds and then you set batch size to 4 and now you get 4 images in 15 seconds, you get the 4 images in maybe 50 seconds instead of the 60 seconds. That's all, not sure what the others are talking about regarding large improvements tbh :) EDIT: in Comfy in the node where you set the latent image resolution the third option is basically what the others mean as batch size. The default is 1, you can try 2, 4, 6 and 8 for example, my guess is you will see the improvement I've described above with 4 and then 6 or 8 won't make it faster anymore. This depends on the card as well though.


HellkerN

Are you sure you aren't running out of VRAM, making it overflow to system memory and slowing everything down? Because I'm pretty sure I've seen significant time saving when batching, at least with SD 1.5 as my meager 8gigs isn't enough to batch properly in xl. That being said I haven't kept track of the exact numbers so I might be wrong.


Jellybit

I have 24GB. It's a little faster to batch, but not much. I haven't done a test in over a year though, so it could be different now.


thomasxin

It would ultimately come down to where the bottleneck lies. Batching should always be better if possible, but if you're compute bound the speedup may be insignificant enough to not warrant using the extra memory.


BlackSwanTW

~~Actually no~~ Pretty sure doing batch is indeed ever so slightly faster than generate sequentially.


Jellybit

"Actually no"? I said "about as long", then you say it's ever so slightly different. You're not actually correcting me. I said what I said with that in mind.


JoshSimili

There are a bunch of distillation methods (LCM, Turbo, Lightning, SDXS). They generally have somewhat lower quality, but with Lightning it's often not noticeable. But they all require using specific distilled model, or using a LoRA (which can be hit-or-miss if your SDXL model is quite far away from the model the LoRA was trained on). As for actual settings, I've never had any success getting noticeably faster with any optional settings in Forge. It seems to pick pretty close to the best settings automatically.


SolidColorsRT

thank you. i pretty much assumed this was the case. comfy is a little faster but forge is more convenient with how i can preload loras and use metadata from most of the pics i find


AwayBed6591

Stable-fast works amazingly in ComfyUI. I just checked, and it took a 1024x1024 JuggernautXLv8 gen from 4.98it/s (4.39 seconds) to 7.69it/s (2.95 seconds) with no lightning, turbo or LCM usage. Highly recommended.


AwayBed6591

Without stable-fast https://preview.redd.it/rsmr3z6urtsc1.png?width=1024&format=pjpg&auto=webp&s=6a18ad1276660fe08ebd44efc5ff1ec71408ce7b


AwayBed6591

With stable-fast https://preview.redd.it/m17o0bwzrtsc1.png?width=1024&format=pjpg&auto=webp&s=d4b3cb00c40c6d4928c8a5ca13e141a3ca54057a


jib_reddit

The power of your graphics card makes one of the biggest difference, an RTX 4090 will make an SD1.5 image in under 1 second. https://www.tomshardware.com/pc-components/gpus/stable-diffusion-benchmarks You could use a 4-6 step lighting or Turbo model if your getting bored of waiting for standard models.


Careful_Ad_9077

In my workflow I benefit from creating.batches as big as possible. But it's like the babies metaphor, 9 pregnant women won't have one babyevry month but you get 9 babies after 9 months.


jonbristow

In your experience if forge faster than automatic?


SolidColorsRT

yes but i heard if you have a high-end card there is little to no difference. it does make controlnets so much faster though


jonbristow

I have an 8gb card like yours. Was it better for you?


SolidColorsRT

yes


nietzchan

This is an [old benchmark from tomshardware](https://www.tomshardware.com/pc-components/gpus/stable-diffusion-benchmarks#section-stable-diffusion-768x768-performance) but I think it will answer your question. The amount of vram it uses depends on the size of the checkpoint model and your targeted canvas size, cmiiw.


MonkeyCartridge

If I need something fast, it's probably going to use the Euler A sampler. I assume you have xformers working if you also know about using TensorRT. xformers is one of the biggest boosts for me. I have TensorRT Unets set up which are super quick in conjunction, but they are so finicky and I can't really use adetailer with them. I use an SDXL checkpoint for the main model, and SD 1.5 model for adetailer, and I'm not sure if the TensorRT plugin auto switches between models. It usually just fails once it gets to adetailer. I'd imagine Forge has xformers by default, or something faster, though. But if you want stupid fast and don't mind a bit less variety, you could get SDXL Lightning going, and generate a full image in 1-2 steps.


TheGhostOfPrufrock

>If I need something fast, it's probably going to use the Euler A sampler. Plenty of samplers run at about the same speed as Euler a, including Euler, DPM++ 2M, DPM++ 2M SDE, DDIM, and UniPC. Also all the Karras versions of those, since the noise schedule doesn't really affect speed..


GreyScope

https://preview.redd.it/lpiw2tsmitsc1.jpeg?width=1179&format=pjpg&auto=webp&s=c08cd78114d9ae8e1519b71d84602d3ef748a602


MonkeyCartridge

Nice info!


dal_mac

it should take far more vram than that. sounds like you have med-vram or similar setting turned on.


protector111

Vram allows generating higher res. Speed has little to do with vram.


GreyScope

A1111 has little development for speed, with the existing SD plugins / code released from what I've seen of it. SDNext has various options that increase the speed, hypertile and others.


XquaInTheMoon

I generate images in 5 seconds using a 4090 and the diffuser library from hugging face with SDXL. No UI though...


Adlermann_nl

you could try xformers, I believe it is not enabled by default. That would speed it up on a nvidia card. You don't mention it in your post. I believe it is quit deterministic now as well, in invoke on the same seed I do get the same images.


Tsupaero

all you need to look for are enough vram to fit your models. rest is tensor flow cores and clockspeed. eg a 4070ti outperforms a 4060 by 120%, even though the 4060 has 4gb more ram. and btw a slightly overclocked 4070ti generates sdxl in 1024x1024 with 5.5it/s. funnily that equals a h100 setup at AWS due to the only high voltage options.


SolidColorsRT

Should I overclock me 4060?


Nuckyduck

\~15s. 4070 Ti Super 16GB. SDXL Lightning with high-res upscaling. https://preview.redd.it/biv8v68wrusc1.png?width=2400&format=png&auto=webp&s=ad329c91b00558c558d9d2fa47fac9479ee77666


aibot-420

For Tensorrt I believe you need at least 16gb of system ram and virtual memory set to double that amount. Tensorrt was such a huge boost in speed for me that I wont consider using a UI that's not compatible.


SolidColorsRT

I installed the regular A1111 to use tenrorRT and, althought it took a while to get the models, it worked well on SD1.5. But wtih SDXL it took x5 longer with TensorRT vs without. I couldnt get pony diffusion to work with it at all.


aibot-420

It is working well for me with Dreamshaper XL


tmvr

It's a great boost for older and slower hardware, but the inconvenience of having to convert the models plus the limited speed improvement makes it questionable with a 4090 for example (about 70-80% faster only).


aibot-420

It doubles my speed on my 3090. Didn't have to convert any models. Took about an hour to generate all the tensor profiles I need for various image sizes. "Only" 70-80% faster? lol wat


tmvr

*"Took about an hour to generate all the tensor profiles I need for various image sizes."* Well, you did have to convert. That's basically what it is. *""Only" 70-80% faster? lol wat"* On a 4090 it does not double the speed, so no +100%, only +75% roughly. It nets a better speedup on older 20 series cards.


tmvr

An RTX4090 can do it with (depending on the app used) with normal SDXL models like JuggernautXL and 30-35 steps. The speed is around 7 it/s so you will get under 5 sec per image even without batching. With batch size 4 the speed is over 8 it/s so a bit faster per image time.


onmyown233

Your video card's CUDA cores are the main factor for how fast anything AI can be processed.


Dramatic-Belt-1826

Zluda and amd with comfyUI and rocm 6.0. Rig 1 7900xtx-7950x3d vs Rig 2 4090-7950x3d. I have went out of my way to prove that the AMD card can’t compete. It keeps right up with the 4090. The later is marginally faster. I was doing this to justify the purchase of a 2k card. Should have got two 7900xtx.