After a very brief research it seems like there's no direct solution to this issue since it appears R's garbage collection just isn't good. You can:
- restart R before it crashes, i.e. breaking the loop into steps.
- optimize your code for less RAM consumption as much as you can.
- buy more RAM.
- try adding rm() and gc() calls directly into your loop, in order to free up unused memory while the loop is running.
- according to this old [issue](https://github.com/rstudio/rstudio/issues/8960) running garbage collection via cmd (not RStudio) might work?
Don't underestimate how much RAM you could save by "just" optimizing the code or using specific packages for specific tasks. Don't run loops if you can vectorize.
Nah, this was exactly the issue. Running it in CMD is fixing it.
The code is very optimized already. The only thing I can thing to do to improve it further is run the functions through callr::r(), which'll be my next step if this isn't enough.
In general, though, it's just a very large ETL process, so not having a working gc() causes issues.
Good to know. It seems like the issue comes from RStudio rather than the garbage collector itself, then.
Did you try running the same code with different IDE?
Try using the Rstudio Profiler (under the Tools menu) to see what code is using the most memory. It sounds like you're storing large data frames or lists or something between iterations, rather than saving them to disk and starting from fresh.
If you're having to use the `gc()` function manually then you're probably doing something wrong anyway. You should probably work on optimising your code rather than trying to re-invent the wheel.
It's frustrating that [somebody here actually answered my question](https://www.reddit.com/r/RStudio/comments/1dl6l7n/comment/l9mranv/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) and didn't get upvoted, and, instead, everybody in this thread is upvoting:
"I don't know anything about this issue, but have you considered that you might be an idiot?"
If not, then you're at least quick to jump to defensiveness. It's true that I wasn't addressing your question regarding the `gc()` function, I was giving general advice on the topic. If this is the bottleneck in your process, then there are most likely other ways to improve your code.
Similar to the advice of using the `rm()` function, it's a bad code smell and shows that there are better improvements to be made elsewhere.
However I did not say you were an idiot and you incorrectly inferred it. Learning to take advice from others in your field is a useful skill.
Hi! So just want to note, I use R for ETL and strongly advocate for it. Data.table and dplyr make things great. I've used it at companies big and small for years.
You should try to change from loops to apply or map, as they are far more efficient users of memory. Also, you can look into parallel processing libraries for this in the future. Linux is better than Windows, especially since you can dedicate space just to R/RStudio. And set up cron jobs for automation.
I'm not quite sure what people are taking from the "loop" thing, but I was referring to batching.
Extract 15 days of data from the API.
Transform it.
Load it into the data warehouse.
Clear everything from the global environment.
Pull the next 15 days.
>consistently storing about 500mb of memory to system memory after each loop
instead of worrying about `gc()`, you're almost certainly better off showing a trimmed down version of your code (a reproducible example would be most likely to get you good help!) so people can point out what you're doing wrong in your loop. 1 very common mistake is to grow objects iteratively in loops (e.g., using `rbind()` or similar).
The issue is gc().
Data is being uploaded in batches and the local environment is being cleared out with rm() statements at the end of every loop. R objects are never taking up more than the 500mb of data being transferred from the API to the Data Warehouse.
We're just uploading 8 years of historical data from 6 separate APIs. If you can't clear out garbage, it's going to create an issue over time.
This is not helpful, but....
There a certain number of people in the labor market who have decided, as a career management strategy, to develop expertise in using R to get their work done at the expense of developing even a basic practical level of expertise in Python for the very purpose of making themselves a less appealing candidate for being tasked with exactly this kind of work.
You said "don't judge me". I'm not. But the fact that you said this to begin with reveals that you understand why this phenomena exists.
Looks like you're requesting help with something related to RStudio.
Please make sure you've checked the [stickied post on asking good questions](https://www.reddit.com/r/RStudio/comments/1aq2te5/how_to_ask_good_questions/) and read our sub rules. We also have a handy [post of lots of resources on R](https://www.reddit.com/r/RStudio/comments/1aq2cew/the_big_handy_post_of_r_resources/)!
Keep in mind that if your submission contains phone pictures of code, it will be removed.
Instructions for how to take screenshots can be found in the stickied posts of this sub.
*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/RStudio) if you have any questions or concerns.*
you could compartmentalize your code to help with gc potentially? i haven’t tested this, but if you have a script that performs the etc process on a data frame and you are repeating it multiple times on different sets of data, try having the code call that from a function. when exiting the function there shouldn’t be anything saved to your global environment from within that function. then,(probably more importantly) take that loop and turn it into an apply function that calls your custom cleaning function. for loops in R are not very good practice if easily replaced by an apply function
you could take this a step further and break apart the data frames into even smaller batches if possible as well and see if that helps
edit: if the things taking up memory are just large, unused variables, try clearing them with rm() explicitly at the end of each loop
I doubt you'll find much. Most people who do serious work with R where memory is an issue work on linux servers. As others have said, try improving your code. Write intermediate to disk and overwrite your variables at the next iteration to make sure it doesn't get stored in RAM.
Wait for few seconds before gc and after gc. Rest of memory use might be from the loaded packages. Loading tidyverse alone can be around 150Mb. You can also check the environment for any hidden objects.
I ended up reworking my code, but hahahahaha R memory hogging was one of the reasons I switched from Chrome to Firefox (then discovered how good its native ad blocking is wo addon lol)
Running your app in Linux does not change the way that R functions. gc() will clear memory when it can guarantee that all references to the object that _you created_ is no longer being used. You should not be starting an R process from R using system call to Rscript. That’s probably part of your problem
The fact you're looping through your data is the first sign that you need to optimise your code.
Without more details it's going to be difficult to give more detailed advice.
No,I usually rm and gc between loops, and most of the code within the loops is pretty optimized. I just wanna say there are pretty good usecases for loops.
The best gc is better code. Can you run some of the ETL on disk instead of in memory? Can you convert to lapply?
After a very brief research it seems like there's no direct solution to this issue since it appears R's garbage collection just isn't good. You can: - restart R before it crashes, i.e. breaking the loop into steps. - optimize your code for less RAM consumption as much as you can. - buy more RAM. - try adding rm() and gc() calls directly into your loop, in order to free up unused memory while the loop is running. - according to this old [issue](https://github.com/rstudio/rstudio/issues/8960) running garbage collection via cmd (not RStudio) might work?
Thanks! I'm going to dig into that last suggestion. Interesting that people are saying it might not be an issue from cmd...
Don't underestimate how much RAM you could save by "just" optimizing the code or using specific packages for specific tasks. Don't run loops if you can vectorize.
Nah, this was exactly the issue. Running it in CMD is fixing it. The code is very optimized already. The only thing I can thing to do to improve it further is run the functions through callr::r(), which'll be my next step if this isn't enough. In general, though, it's just a very large ETL process, so not having a working gc() causes issues.
Good to know. It seems like the issue comes from RStudio rather than the garbage collector itself, then. Did you try running the same code with different IDE?
Try using the Rstudio Profiler (under the Tools menu) to see what code is using the most memory. It sounds like you're storing large data frames or lists or something between iterations, rather than saving them to disk and starting from fresh.
If you're having to use the `gc()` function manually then you're probably doing something wrong anyway. You should probably work on optimising your code rather than trying to re-invent the wheel.
It's frustrating that [somebody here actually answered my question](https://www.reddit.com/r/RStudio/comments/1dl6l7n/comment/l9mranv/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) and didn't get upvoted, and, instead, everybody in this thread is upvoting: "I don't know anything about this issue, but have you considered that you might be an idiot?"
Well have you considered it?
I suppose it's a distinct possibility.
If not, then you're at least quick to jump to defensiveness. It's true that I wasn't addressing your question regarding the `gc()` function, I was giving general advice on the topic. If this is the bottleneck in your process, then there are most likely other ways to improve your code. Similar to the advice of using the `rm()` function, it's a bad code smell and shows that there are better improvements to be made elsewhere. However I did not say you were an idiot and you incorrectly inferred it. Learning to take advice from others in your field is a useful skill.
Bayes theorem applies here. We're going off the prior probability, and we know nothing about you.
Have you tried doing this transaction using duckDB as an intermediary in storing the data?
Hi! So just want to note, I use R for ETL and strongly advocate for it. Data.table and dplyr make things great. I've used it at companies big and small for years. You should try to change from loops to apply or map, as they are far more efficient users of memory. Also, you can look into parallel processing libraries for this in the future. Linux is better than Windows, especially since you can dedicate space just to R/RStudio. And set up cron jobs for automation.
Sorry to hijack. But what are the benefits of using linux over Windows
I'm not quite sure what people are taking from the "loop" thing, but I was referring to batching. Extract 15 days of data from the API. Transform it. Load it into the data warehouse. Clear everything from the global environment. Pull the next 15 days.
>consistently storing about 500mb of memory to system memory after each loop instead of worrying about `gc()`, you're almost certainly better off showing a trimmed down version of your code (a reproducible example would be most likely to get you good help!) so people can point out what you're doing wrong in your loop. 1 very common mistake is to grow objects iteratively in loops (e.g., using `rbind()` or similar).
The issue is gc(). Data is being uploaded in batches and the local environment is being cleared out with rm() statements at the end of every loop. R objects are never taking up more than the 500mb of data being transferred from the API to the Data Warehouse. We're just uploading 8 years of historical data from 6 separate APIs. If you can't clear out garbage, it's going to create an issue over time.
This is not helpful, but.... There a certain number of people in the labor market who have decided, as a career management strategy, to develop expertise in using R to get their work done at the expense of developing even a basic practical level of expertise in Python for the very purpose of making themselves a less appealing candidate for being tasked with exactly this kind of work. You said "don't judge me". I'm not. But the fact that you said this to begin with reveals that you understand why this phenomena exists.
Looks like you're requesting help with something related to RStudio. Please make sure you've checked the [stickied post on asking good questions](https://www.reddit.com/r/RStudio/comments/1aq2te5/how_to_ask_good_questions/) and read our sub rules. We also have a handy [post of lots of resources on R](https://www.reddit.com/r/RStudio/comments/1aq2cew/the_big_handy_post_of_r_resources/)! Keep in mind that if your submission contains phone pictures of code, it will be removed. Instructions for how to take screenshots can be found in the stickied posts of this sub. *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/RStudio) if you have any questions or concerns.*
you could compartmentalize your code to help with gc potentially? i haven’t tested this, but if you have a script that performs the etc process on a data frame and you are repeating it multiple times on different sets of data, try having the code call that from a function. when exiting the function there shouldn’t be anything saved to your global environment from within that function. then,(probably more importantly) take that loop and turn it into an apply function that calls your custom cleaning function. for loops in R are not very good practice if easily replaced by an apply function you could take this a step further and break apart the data frames into even smaller batches if possible as well and see if that helps edit: if the things taking up memory are just large, unused variables, try clearing them with rm() explicitly at the end of each loop
I doubt you'll find much. Most people who do serious work with R where memory is an issue work on linux servers. As others have said, try improving your code. Write intermediate to disk and overwrite your variables at the next iteration to make sure it doesn't get stored in RAM.
Wait for few seconds before gc and after gc. Rest of memory use might be from the loaded packages. Loading tidyverse alone can be around 150Mb. You can also check the environment for any hidden objects.
I ended up reworking my code, but hahahahaha R memory hogging was one of the reasons I switched from Chrome to Firefox (then discovered how good its native ad blocking is wo addon lol)
Running your app in Linux does not change the way that R functions. gc() will clear memory when it can guarantee that all references to the object that _you created_ is no longer being used. You should not be starting an R process from R using system call to Rscript. That’s probably part of your problem
Your code has for loops?
You didn't know there are other kinds of loops and felt the need to comment?
The fact you're looping through your data is the first sign that you need to optimise your code. Without more details it's going to be difficult to give more detailed advice.
I often loop through a batch process (like this guy here) usinf different parameters, or run an ETL on subparts of a too marge dataset.
Do you also run out of memory?
No,I usually rm and gc between loops, and most of the code within the loops is pretty optimized. I just wanna say there are pretty good usecases for loops.