Hail massively leaking memory

Hi!

We have a setup in which we run a (for purposes of this question, this is a bit simplified) interactive Python / Hail session, but the Spark context is such that calculations run on an AWS EMR cluster, of which the machine the interactive Python session is running on is not part.

We noticed that, at least for some tasks, Hail massively leaks memory. For example, we use this setup to obtain a the Pan-UKB LD block matrices to get a part of a single row of the block matrix. Even without collection the actual values, this adds up to 200 Mb of memory (the amount varies) which is not freed afterwards.
We are able to free some of that memory by manually calling the JVM garbage collector via hl.spark_context()._jvm.System.gc(), but that does not work reliably.
In code, what we do looks approximately like this (we are actually trying to obtain the Pearson correlation between a single variant and a region):

    ld_matrix, all_indices = ... # load data Pan-UKB LD block matrix and indices
    variant_match = (all_indices.locus == <some hail Locus>) & (
        all_indices.alleles == [<some reference allele>, <some alternate allele>]
    )
    variant_index = all_indices.filter(variant_match).idx.collect()[0]
    # Subset index table to region
    interval = hl.locus_interval(
        <some chromosome>,
        <lower region boundary>,
        <upper region boundary>,
        includes_end=True
    )
    region_indices = all_indices.filter(interval.contains(all_indices.locus))
    ld_values = (
        ld_matrix[variant_index, :]
        .filter_cols(region_indices.idx.collect())
    )

I’d be happy to provide a minimal working example, if it is of any help, but first I would like to ask if there is some sort of cache we can manually clear or whether anyone has other suggestions what might be going on.

Block matrices are stored as 128-MB blocks by default, so reading a single row should allocate at least 128MB of memory. Java’s memory model doesn’t include freeing memory as soon as it’s out scope (a la C++ shared/unique pointers with reference counting), but rather doing a mark-and-sweep GC pass when there is memory pressure.

Are you seeing the JVM throw OOM exceptions? That’s a cause for concern if so. But seeing Java fill up available memory without failure shouldn’t be a problem.

Thank you a lot for this insightful answer! I honestly have no idea how Java’s garbage collection (or Java, for what it’s worth…) works, so that was really helpful!
I now might have an idea of what’s going on then: actually, one bit of information (which I hadn’t thought was relevant) was missing in my question, namely that all this runs in a Docker container. I thus suspect that Java isn’t aware that it’s in a container and thinks it can just eat up all the memory. I will try working around this by following some of the advice in this Medium post. I should be able to set options through the JAVA_OPTS environment variable, if I’m not mistaken.
For the record, and to answer your question: no, I did not see the JVM throw OOM exceptions; instead the Docker host would just start swapping at one point.

There is one thing in your answer I don’t understand, though: you say that block matrices are stored as 128 Mb blocks, and that reading a single row should allocate at least 128 Mb of memory. I am now wondering why this happens on the driver. I would have expected this kind of memory usage on the “worker” nodes (please forgive me my likely incorrect Spark speak). Is there a way to prevent Spark from loading that kind of data in the driver’s memory? I would have expected that the only data materialized in the driver’s memory is whatever I get when I call .collect() / .to_numpy() or similar methods.

The post you’re looking for is: How do I increase the memory or RAM available to the JVM when I start Hail through Python?. Spark through Python is a bit more complicated than running a plain old java app.

On your second point, you correctly understand how Hail / Spark works.

I can’t gainfully comment on the 200 MB of RAM usage without the Hail log files, the script, and a profiler hooked up to the JVM. When using our hailctl utility, a Hail driver node has 16 cores and ~60GB of RAM. It’s entirely possible that Hail or Spark is doing something silly and holding ~200MB of RAM around because we usually have an abundance of memory available.

I empathize with the pain of diagnosing performance issues in a JVM controlled via Python. We’re actively moving away from Spark & the JVM for that reason but its a multi-year project.

Thanks, @danking!
I experimented a bit further and can in fact confirm that the JVM garbage collection works as @tpoterba described, with the following caveat: the first couple Hail queries pile up quite some memory, in total perhaps up to a Gb. But then, memory consumption plateaus, even after tens of consecutive queries. The problem here is that the machine I ran this on had a relatively low amount of RAM (4 Gb), and the initial memory build up that is not garbage collected exceeded the total memory available. After doubling the available RAM, everything works neatly :slightly_smiling_face:

Now I am still unclear as to why the initial memory build-up of ~1 Gb is not freed, even when the machine memory limit approached. This might or might not be a bug and as @danking says, it probably requires more debugging / memory profiling work.

In any case, as for our use case, we can easily use a machine with more memory and thus are good :+1: Thanks a lot to both of you!

As for moving away from Spark &the JVM, that’s good to hear and I can imagine how big of an undertaking that is. I guess that’s the Hail Batch line of work. But already now, a big compliment on the ease-of-use of Hail, the exhaustive documentation and the great support here!

1 Like