Using HDFS on EMR

Hi Hail team! I’m working on a pipeline that runs in the AWS EMR, and at the moment it’s having some performance issues. I have some questions about improving performance:

  1. At the moment we are readying and writing our data to an s3 bucket, but we were considering writing directly to the HDFS. However, I’ve seen on several other forum posts @danking warn against writing matrix tables to HDFS. Does that same advice still apply to AWS EMR?
  2. Along the same lines I was considering upgrading our volume mount to something with faster I/O. Could this help the performance as well?

Thanks,
Rob

Hi @gilmorera

We don’t have any experience running hail in AWS. But in GCP we write all table or matrixtable results (final or intermediate) to object storage. The main reason is that hail is designed to make use of preemptible compute nodes, which can disappear at any time. Persisting all results of distributed computation in durable storage is a simple design that gives us fault tolerance. Since hail pipelines are typically streaming through large amounts of data, and cloud object stores have high bandwidth, we don’t find this to typically be a performance bottleneck.

If you could share more details about what performance issues you’re seeing, we might be able to help.

Hi @patrick-schultz. Thanks for the quick response!

I will definitely use this thread to post more about our issue, but for now I would like to dig into this storage topic if that’s alright with you? In AWS EMR we are using EC2 instance nodes. AWS EC2 nodes can either be SPOT or ON_DEMAND. I believe that GCP’s preemptible compute nodes are analogous to EC2 SPOT instance nodes. However, we are using ON_DEMAND which cannot be reclaimed by AWS. So we are essentially guaranteed the same hardware (unless something catastrophic happens) until the EMR cluster is terminated. In this scenario, would it be safe to say HDFS could be used utilized temporarily during our pipeline? Essentially, my goal would be to use HDFS during the processing stage and at the end of the job copy the primary results to S3 or another permanent storage location.
Thanks again,
Rob

Additionally, I’m curious what your thoughts would be on that performance? My hypothesis is that we would get better performance in terms of read/write speed using HDFS since the Hadoop/Spark backend has better “awareness” of what’s happening in HDFS vs S3.

For more context of why I’m going down this route is that we have a lot Hail Tables in a separate S3 bucket that we’re using to annotate our matrix tables in the EMR cluster pipeline. If you’re familiar with the seqr-loading-pipelines it’s essentially an older version of that. I’m working on measuring the size of that data locally as we speak (VEP, gnomad, clinvar, dbnsfp, CADD, etc., and 5-10 proprietary cohort AF datasets). Additionally, we’re running like 20 EMR clusters at the same time using the same pipeline and using the same annotation data to process our VCF files. We are seeing bottlenecks during one of the spark stages that seems to have memory issues:

‘Error summary: %s’ % (deepest, full, hail.version, deepest), error_id) from None
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 369 in stage 7.0 failed 10 times, most recent failure: Lost task 369.9 in stage 7.0 (TID 6656, ip-172-21-68-220.ec2.internal, executor 38): ExecutorLostFailure (executor 38 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:

You can see in the error logs that the task failed 10 times before reporting this error to our main logs. When we dig deeper into the logs we also see that the other 9 tasks that failed did so for the same reason. To try and resolve this issue we did change the settings that were suggested in the error log:

Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

While we have seen some success with these changes, we are still getting very long running jobs (2.5 days…) before completion. We normally expect maybe 16 hour runs based on our previous experience, but even that seems like an unreasonable level of performance.

I would be grateful for any help you could give me with all of this in mind. However, it would be appreciated if you could help me understand the HDFS hypothesis first :slight_smile:.