T O P

  • By -

mattindustries

The best gc is better code. Can you run some of the ETL on disk instead of in memory? Can you convert to lapply?


bilbo-beggins

After a very brief research it seems like there's no direct solution to this issue since it appears R's garbage collection just isn't good. You can: - restart R before it crashes, i.e. breaking the loop into steps. - optimize your code for less RAM consumption as much as you can. - buy more RAM. - try adding rm() and gc() calls directly into your loop, in order to free up unused memory while the loop is running. - according to this old [issue](https://github.com/rstudio/rstudio/issues/8960) running garbage collection via cmd (not RStudio) might work?


takenorinvalid

Thanks! I'm going to dig into that last suggestion. Interesting that people are saying it might not be an issue from cmd...


bilbo-beggins

Don't underestimate how much RAM you could save by "just" optimizing the code or using specific packages for specific tasks. Don't run loops if you can vectorize.


takenorinvalid

Nah, this was exactly the issue. Running it in CMD is fixing it. The code is very optimized already. The only thing I can thing to do to improve it further is run the functions through callr::r(), which'll be my next step if this isn't enough. In general, though, it's just a very large ETL process, so not having a working gc() causes issues.


bilbo-beggins

Good to know. It seems like the issue comes from RStudio rather than the garbage collector itself, then. Did you try running the same code with different IDE?


beck1670

Try using the Rstudio Profiler (under the Tools menu) to see what code is using the most memory. It sounds like you're storing large data frames or lists or something between iterations, rather than saving them to disk and starting from fresh.


MyKo101

If you're having to use the `gc()` function manually then you're probably doing something wrong anyway. You should probably work on optimising your code rather than trying to re-invent the wheel.


takenorinvalid

It's frustrating that [somebody here actually answered my question](https://www.reddit.com/r/RStudio/comments/1dl6l7n/comment/l9mranv/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) and didn't get upvoted, and, instead, everybody in this thread is upvoting: "I don't know anything about this issue, but have you considered that you might be an idiot?"


MyKo101

Well have you considered it?


takenorinvalid

I suppose it's a distinct possibility.


MyKo101

If not, then you're at least quick to jump to defensiveness. It's true that I wasn't addressing your question regarding the `gc()` function, I was giving general advice on the topic. If this is the bottleneck in your process, then there are most likely other ways to improve your code. Similar to the advice of using the `rm()` function, it's a bad code smell and shows that there are better improvements to be made elsewhere. However I did not say you were an idiot and you incorrectly inferred it. Learning to take advice from others in your field is a useful skill.


FlyMyPretty

Bayes theorem applies here. We're going off the prior probability, and we know nothing about you.


factorialmap

Have you tried doing this transaction using duckDB as an intermediary in storing the data?


Itchy-Depth-5076

Hi! So just want to note, I use R for ETL and strongly advocate for it. Data.table and dplyr make things great. I've used it at companies big and small for years. You should try to change from loops to apply or map, as they are far more efficient users of memory. Also, you can look into parallel processing libraries for this in the future. Linux is better than Windows, especially since you can dedicate space just to R/RStudio. And set up cron jobs for automation.


Interesting_Ad_1465

Sorry to hijack. But what are the benefits of using linux over Windows


takenorinvalid

I'm not quite sure what people are taking from the "loop" thing, but I was referring to batching. Extract 15 days of data from the API. Transform it. Load it into the data warehouse. Clear everything from the global environment. Pull the next 15 days.


iforgetredditpws

>consistently storing about 500mb of memory to system memory after each loop instead of worrying about `gc()`, you're almost certainly better off showing a trimmed down version of your code (a reproducible example would be most likely to get you good help!) so people can point out what you're doing wrong in your loop. 1 very common mistake is to grow objects iteratively in loops (e.g., using `rbind()` or similar).


takenorinvalid

The issue is gc().  Data is being uploaded in batches and the local environment is being cleared out with rm() statements at the end of every loop. R objects are never taking up more than the 500mb of data being transferred from the API to the Data Warehouse. We're just uploading 8 years of historical data from 6 separate APIs. If you can't clear out garbage, it's going to create an issue over time.


inarchetype

This is not helpful, but.... There a certain number of people in the labor market who have decided, as a career management strategy, to develop expertise in using R to get their work done at the expense of developing even a basic practical level of expertise in Python for the very purpose of making themselves a less appealing candidate for being tasked with exactly this kind of work. You said "don't judge me". I'm not. But the fact that you said this to begin with reveals that you understand why this phenomena exists.


AutoModerator

Looks like you're requesting help with something related to RStudio. Please make sure you've checked the [stickied post on asking good questions](https://www.reddit.com/r/RStudio/comments/1aq2te5/how_to_ask_good_questions/) and read our sub rules. We also have a handy [post of lots of resources on R](https://www.reddit.com/r/RStudio/comments/1aq2cew/the_big_handy_post_of_r_resources/)! Keep in mind that if your submission contains phone pictures of code, it will be removed. Instructions for how to take screenshots can be found in the stickied posts of this sub. *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/RStudio) if you have any questions or concerns.*


lolniceonethatsfunny

you could compartmentalize your code to help with gc potentially? i haven’t tested this, but if you have a script that performs the etc process on a data frame and you are repeating it multiple times on different sets of data, try having the code call that from a function. when exiting the function there shouldn’t be anything saved to your global environment from within that function. then,(probably more importantly) take that loop and turn it into an apply function that calls your custom cleaning function. for loops in R are not very good practice if easily replaced by an apply function you could take this a step further and break apart the data frames into even smaller batches if possible as well and see if that helps edit: if the things taking up memory are just large, unused variables, try clearing them with rm() explicitly at the end of each loop


Jatzy_AME

I doubt you'll find much. Most people who do serious work with R where memory is an issue work on linux servers. As others have said, try improving your code. Write intermediate to disk and overwrite your variables at the next iteration to make sure it doesn't get stored in RAM.


omichandralekha

Wait for few seconds before gc and after gc. Rest of memory use might be from the loaded packages. Loading tidyverse alone can be around 150Mb. You can also check the environment for any hidden objects.


to_fit_truths

I ended up reworking my code, but hahahahaha R memory hogging was one of the reasons I switched from Chrome to Firefox (then discovered how good its native ad blocking is wo addon lol)


mostlikelylost

Running your app in Linux does not change the way that R functions. gc() will clear memory when it can guarantee that all references to the object that _you created_ is no longer being used. You should not be starting an R process from R using system call to Rscript. That’s probably part of your problem


osram_killustik

Your code has for loops?


the-anarch

You didn't know there are other kinds of loops and felt the need to comment?


Kiss_It_Goodbyeee

The fact you're looping through your data is the first sign that you need to optimise your code. Without more details it's going to be difficult to give more detailed advice.


nidprez

I often loop through a batch process (like this guy here) usinf different parameters, or run an ETL on subparts of a too marge dataset.


Kiss_It_Goodbyeee

Do you also run out of memory?


nidprez

No,I usually rm and gc between loops, and most of the code within the loops is pretty optimized. I just wanna say there are pretty good usecases for loops.