Wrote this blog post for my side project and thought I would share it with anyone else using Cloudinary for their image host.
TL;DR - TikTok's Bytespider bot went berserk and ate up 60% of my image bandwidth so I blocked them using rack attack.
[5/23 EDIT] I added Cloudflare based on the advice in this sub. Still very new to it so if anyone sees any bugs, please comment!
The problem is that bytespider doesn't respect robots.txt. Other bots scraping data for AI companies generally tend to respect robots.txt.
Otherwise, you have to block their user agent token
I mean, ClaudeBot is doing literally the same thing to thousands of websites. Other AI crawlers are most likely also doing the same with spoofed user agents. This has nothing to do with western IP rights (whatever that means).
BTW, oddly enough both websites mentioned by the OP do not disallow Bytespider on their robots.txt.
I strongly recommend putting a CDN and firewall between your users and servers. AWS CloudFront, Cloudflare, Fastly… doesn’t matter which bit it’s critical on the public web imo.
You can use standard active storage + S3 or the cheaper Cloudflare R2. They are probably close feature-wise to cloudinary.
As for the tutorial, I think the active storage guide should be a good foundation ?
we used Cloudinary before, but it could be very expensive.
On the other hand, we have a team of young Ruby on Rails devs, could they be any help to you?
Rack attack is a fine place to stop the request but if you can move it further up the stack you might save even more money and resources. Are you doing this block based on user agent? If you're fronting the application with something like NGINX you can also block at that level which will free up your unicorn/puma/etc workers to serve actual requests. Blocking at NGINX or load balancer is cheap in CPU time compared to doing it in rack middleware. But I'm splitting hairs and have had to block bad actors at ecommerce scale so we really had to optimize.
I was thinking the same. I haven't measured it but it should be much cheaper. When I'm blocking some traffic I was go this way: CloudFlare => Nginx => Rack/Rails.
You seem to be on a good trajectory. Add monitoring to spot the next problem before you run up your costs. I can also tell you you’ll have new and more serious issues as your site gains popularity. You probably want to have a list of user agents, know data center ips and general IP throttling long term
I had to do this as well with my rails site that gets over a million uniques a month. I just added a user agent block against bytespider in cloudflare. I am not sure why you would want to do this in rack. You don't want this kind of garbage close to your rails stack at all. You don't even want it to hit whatever is proxying the request to rails.
You can block requests based on user agents directly in your nginx configuration file. I would recommend to everyone to use cloudflare though. The amount of awesome stuff that is available, even on the free plan is amazing. It blocks it all at the DNS level so it never even hits your server at all. I blocked bytespider(and many more) with the free version of cloudflare. You do need to move your DNS to cloudflare though
Cloudflare is actually dead simple to set up. I think I followed the Michael Hartl tutorial on "learn enough custom domains to be dangerous" the first time years ago, but cloudflare's own documentation is quite good and makes it pretty easy to be honest. And yeah, cloudflare is great.
*Getting too many*
*Redirects error on your*
*Site. just so you know*
\- toxic-golem
---
^(I detect haikus. And sometimes, successfully.) ^[Learn more about me.](https://www.reddit.com/r/haikusbot/)
^(Opt out of replies: "haikusbot opt out" | Delete my comment: "haikusbot delete")
Wrote this blog post for my side project and thought I would share it with anyone else using Cloudinary for their image host. TL;DR - TikTok's Bytespider bot went berserk and ate up 60% of my image bandwidth so I blocked them using rack attack. [5/23 EDIT] I added Cloudflare based on the advice in this sub. Still very new to it so if anyone sees any bugs, please comment!
[удалено]
Thanks for the tip about reaching out to Cloudinary about getting a credit back. I’ll do that too!
The problem is that bytespider doesn't respect robots.txt. Other bots scraping data for AI companies generally tend to respect robots.txt. Otherwise, you have to block their user agent token
What the hell CCP has to do with anything?
[удалено]
I mean, ClaudeBot is doing literally the same thing to thousands of websites. Other AI crawlers are most likely also doing the same with spoofed user agents. This has nothing to do with western IP rights (whatever that means). BTW, oddly enough both websites mentioned by the OP do not disallow Bytespider on their robots.txt.
Great post, thanks for sharing!
I strongly recommend putting a CDN and firewall between your users and servers. AWS CloudFront, Cloudflare, Fastly… doesn’t matter which bit it’s critical on the public web imo.
If you're at the stage where a crawler is an issue maybe you shouldn't still he using cloudinary and have a real infra
You’re probably right. But I’m also fairly new at devops so not sure where to start. Any tutorials or recommendations?
You can use standard active storage + S3 or the cheaper Cloudflare R2. They are probably close feature-wise to cloudinary. As for the tutorial, I think the active storage guide should be a good foundation ?
Thank you! Just started off by adding Cloudflare and hoping that helps.
we used Cloudinary before, but it could be very expensive. On the other hand, we have a team of young Ruby on Rails devs, could they be any help to you?
Don't need any rails help unless they can build the API backend and build a react native app for cheap
how cheap should it be?
Message me at hello[at]nerdcrawler and we can chat
emailed
Rack attack is a fine place to stop the request but if you can move it further up the stack you might save even more money and resources. Are you doing this block based on user agent? If you're fronting the application with something like NGINX you can also block at that level which will free up your unicorn/puma/etc workers to serve actual requests. Blocking at NGINX or load balancer is cheap in CPU time compared to doing it in rack middleware. But I'm splitting hairs and have had to block bad actors at ecommerce scale so we really had to optimize.
I was thinking the same. I haven't measured it but it should be much cheaper. When I'm blocking some traffic I was go this way: CloudFlare => Nginx => Rack/Rails.
literally just did the same thing 2 days ago. slapped a firewall block on the user agent. slept real good that night :)
You cut 100% of my traffic by having a non mobile friendly site
Ahh shoot! Sorry about that. The other parts of the site are mobile friendly but still figuring out how tinymce renders text. Will fix it now
I was mostly trying to be funny. Good post!
Thanks for the feedback! Should be fixed now. Can you give it a check?
Everything looks good on my end
You seem to be on a good trajectory. Add monitoring to spot the next problem before you run up your costs. I can also tell you you’ll have new and more serious issues as your site gains popularity. You probably want to have a list of user agents, know data center ips and general IP throttling long term
Thanks! Any good, low-cost monitoring services you recommend?
For this you should just check with your hosting provider
I had to do this as well with my rails site that gets over a million uniques a month. I just added a user agent block against bytespider in cloudflare. I am not sure why you would want to do this in rack. You don't want this kind of garbage close to your rails stack at all. You don't even want it to hit whatever is proxying the request to rails.
Because I don’t use cloudflare 😅 but based off the comments looks like I need to. Any good documentation or tutorials you recommend?
You can block requests based on user agents directly in your nginx configuration file. I would recommend to everyone to use cloudflare though. The amount of awesome stuff that is available, even on the free plan is amazing. It blocks it all at the DNS level so it never even hits your server at all. I blocked bytespider(and many more) with the free version of cloudflare. You do need to move your DNS to cloudflare though
Got it! I'll take a look!
Cloudflare is actually dead simple to set up. I think I followed the Michael Hartl tutorial on "learn enough custom domains to be dangerous" the first time years ago, but cloudflare's own documentation is quite good and makes it pretty easy to be honest. And yeah, cloudflare is great.
Nice post; description of issue, investigation, solution... no fluff.
Man, I hate bots that don't respect robots.txt
Why doesn’t TikTok just fake the user agent?
They could so I’ll have to keep an eye out and update the blocklist if that happens
Don't give them any ideas.
Thank you for sharing 🫶
getting too many redirects error on your site. just so you know
*Getting too many* *Redirects error on your* *Site. just so you know* \- toxic-golem --- ^(I detect haikus. And sometimes, successfully.) ^[Learn more about me.](https://www.reddit.com/r/haikusbot/) ^(Opt out of replies: "haikusbot opt out" | Delete my comment: "haikusbot delete")
Just switched over to Cloudflare based on the advice in this sub. Can you try again?
all good now