Wednesday, November 17, 2021

Python's Garbage

Python's GC appears to be hugely different to the GC in Java we know and love. I have some PySpark code that collects data from a data set back in the driver and then processes it in Pandas. The first few work nicely but ultimately PySpark barfs with memory issues and no amount of increasing the driver memory seems to solve the problem. This is odd, because each collect itself only retrieves a modest amount of data. After that, I'm not interested in it.

"Reducing memory usage in Python is difficult, because Python does not actually release memory back to the operating system. If you delete objects, then the memory is available to new Python objects, but not free()'d back to the system.

"As noted in the comments, there are some things to try: gc.collect (@EdChum) may clear stuff, for example. At least from my experience, these things sometimes work and often don't. There is one thing that always works, however, because it is done at the OS, not language, level.

Then if you do something like

import multiprocessing
result = multiprocessing.Pool(1).map(huge_intermediate_calc, [something_])[0]

Then the function is executed at a different process. When that process completes, the OS retakes all the resources it used. There's really nothing Python, pandas, the garbage collector, could do to stop that." [StackOverflow]

However, this does not work well for turning PySpark DataFrames to Pandas data frames. Spark is lazy so the DataFrame cannot be serialized and given to the new process (you cannot serialize, for instance, an open network port).

Ultimately, I had to cut my losses and return from StatsModels to Spark's logistic regression implementation.

In addition to this, we have the horrors of what happens with a Spark UDF when written in Python:

"Each row of the DataFrame is serialised, sent to the Python Runtime and returned to the JVM. As you can imagine, it is nothing optimal. There are some projects that try to optimise this problem. One of them is Apache Arrow, which is based on applying UDFs with Pandas." [Damavis]

Yuck. 

No comments:

Post a Comment