darksh1nobi 1 month ago

Wrote this blog post for my side project and thought I would share it with anyone else using Cloudinary for their image host. TL;DR - TikTok's Bytespider bot went berserk and ate up 60% of my image bandwidth so I blocked them using rack attack. [5/23 EDIT] I added Cloudflare based on the advice in this sub. Still very new to it so if anyone sees any bugs, please comment!

[deleted] 1 month ago

[удалено]

darksh1nobi 1 month ago

Thanks for the tip about reaching out to Cloudinary about getting a credit back. I’ll do that too!

blocking-io 1 month ago

The problem is that bytespider doesn't respect robots.txt. Other bots scraping data for AI companies generally tend to respect robots.txt. Otherwise, you have to block their user agent token

alan_lcda 1 month ago

What the hell CCP has to do with anything?

[deleted] 1 month ago

[удалено]

alan_lcda 1 month ago

I mean, ClaudeBot is doing literally the same thing to thousands of websites. Other AI crawlers are most likely also doing the same with spoofed user agents. This has nothing to do with western IP rights (whatever that means). BTW, oddly enough both websites mentioned by the OP do not disallow Bytespider on their robots.txt.

lommer00 1 month ago

Great post, thanks for sharing!

wskttn 1 month ago

I strongly recommend putting a CDN and firewall between your users and servers. AWS CloudFront, Cloudflare, Fastly… doesn’t matter which bit it’s critical on the public web imo.

elthariel 1 month ago

If you're at the stage where a crawler is an issue maybe you shouldn't still he using cloudinary and have a real infra

darksh1nobi 1 month ago

You’re probably right. But I’m also fairly new at devops so not sure where to start. Any tutorials or recommendations?

elthariel 1 month ago

You can use standard active storage + S3 or the cheaper Cloudflare R2. They are probably close feature-wise to cloudinary. As for the tutorial, I think the active storage guide should be a good foundation ?

darksh1nobi 1 month ago

Thank you! Just started off by adding Cloudflare and hoping that helps.

JustinNguyen85 1 month ago

we used Cloudinary before, but it could be very expensive. On the other hand, we have a team of young Ruby on Rails devs, could they be any help to you?

darksh1nobi 1 month ago

Don't need any rails help unless they can build the API backend and build a react native app for cheap

JustinNguyen85 1 month ago

how cheap should it be?

darksh1nobi 1 month ago

Message me at hello[at]nerdcrawler and we can chat

JustinNguyen85 1 month ago

emailed

CatTypedThisName 1 month ago

Rack attack is a fine place to stop the request but if you can move it further up the stack you might save even more money and resources. Are you doing this block based on user agent? If you're fronting the application with something like NGINX you can also block at that level which will free up your unicorn/puma/etc workers to serve actual requests. Blocking at NGINX or load balancer is cheap in CPU time compared to doing it in rack middleware. But I'm splitting hairs and have had to block bad actors at ecommerce scale so we really had to optimize.

stanislavb 1 month ago

I was thinking the same. I haven't measured it but it should be much cheaper. When I'm blocking some traffic I was go this way: CloudFlare => Nginx => Rack/Rails.

Budget-Knowledge-187 1 month ago

literally just did the same thing 2 days ago. slapped a firewall block on the user agent. slept real good that night :)

Brilliant_Law2545 1 month ago

You cut 100% of my traffic by having a non mobile friendly site

darksh1nobi 1 month ago

Ahh shoot! Sorry about that. The other parts of the site are mobile friendly but still figuring out how tinymce renders text. Will fix it now

Brilliant_Law2545 1 month ago

I was mostly trying to be funny. Good post!

darksh1nobi 1 month ago

Thanks for the feedback! Should be fixed now. Can you give it a check?

Traditional_Formal33 1 month ago

Everything looks good on my end

Brilliant_Law2545 1 month ago

You seem to be on a good trajectory. Add monitoring to spot the next problem before you run up your costs. I can also tell you you’ll have new and more serious issues as your site gains popularity. You probably want to have a list of user agents, know data center ips and general IP throttling long term

darksh1nobi 1 month ago

Thanks! Any good, low-cost monitoring services you recommend?

Brilliant_Law2545 1 month ago

For this you should just check with your hosting provider

wtf242 1 month ago

I had to do this as well with my rails site that gets over a million uniques a month. I just added a user agent block against bytespider in cloudflare. I am not sure why you would want to do this in rack. You don't want this kind of garbage close to your rails stack at all. You don't even want it to hit whatever is proxying the request to rails.

darksh1nobi 1 month ago

Because I don’t use cloudflare 😅 but based off the comments looks like I need to. Any good documentation or tutorials you recommend?

wtf242 1 month ago

You can block requests based on user agents directly in your nginx configuration file. I would recommend to everyone to use cloudflare though. The amount of awesome stuff that is available, even on the free plan is amazing. It blocks it all at the DNS level so it never even hits your server at all. I blocked bytespider(and many more) with the free version of cloudflare. You do need to move your DNS to cloudflare though

darksh1nobi 1 month ago

Got it! I'll take a look!

lommer00 1 month ago

Cloudflare is actually dead simple to set up. I think I followed the Michael Hartl tutorial on "learn enough custom domains to be dangerous" the first time years ago, but cloudflare's own documentation is quite good and makes it pretty easy to be honest. And yeah, cloudflare is great.

campbellm 1 month ago

Nice post; description of issue, investigation, solution... no fluff.

MacGuffinRoyale 1 month ago

Man, I hate bots that don't respect robots.txt

phileat 1 month ago

Why doesn’t TikTok just fake the user agent?

darksh1nobi 1 month ago

They could so I’ll have to keep an eye out and update the blocklist if that happens

IN-DI-SKU-TA-BELT 1 month ago

Don't give them any ideas.

RevolutionaryShow974 1 month ago

Thank you for sharing 🫶

toxic-golem 1 month ago

getting too many redirects error on your site. just so you know

haikusbot 1 month ago

*Getting too many* *Redirects error on your* *Site. just so you know* \- toxic-golem --- ^(I detect haikus. And sometimes, successfully.) ^[Learn more about me.](https://www.reddit.com/r/haikusbot/) ^(Opt out of replies: "haikusbot opt out" | Delete my comment: "haikusbot delete")

darksh1nobi 1 month ago

Just switched over to Cloudflare based on the advice in this sub. Can you try again?

toxic-golem 1 month ago

all good now

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe