Index_bgen() on UKBB imputed data expected time

rodrigo.barreiro · January 11, 2022, 7:20pm

Hello Hail team,

I’m trying to index UKBBs BGEN imputed genotyping data with index_bgen(), but it’s taking quite some time. For example, the indexing for the chr22 file (36.6 GB) processed for >6h and didn’t finished.

I also made a test with a small file (1.000 variants) and it was processed in a reasonable time.

Is there some expected time to complete the index_bgen() in these larger files? Or maybe a way to speed up this process.

Thank you for the support!
Rodrigo

tpoterba · January 11, 2022, 7:48pm

do you have a hail log file for the slow pipeline?

rodrigo.barreiro · January 11, 2022, 9:04pm

I’m afraid not. The slow pipeline broke just after the 6h for what appears to be a unrelated issue.

We are running it on a Amazon EMR Jupyter Notebook with PySpark, so I guess this specific log file was lost when the cluster was terminated.

I can try to generate it again if necessary though.

rodrigo.barreiro · January 12, 2022, 6:03pm

Hello again,

I’ve managed to obtain a hail log from the same pipeline execution. I hope this helps.

Thank you all,

2022-01-12 14:49:45 Hail: INFO: Running Hail version 0.2.78-b17627756568
2022-01-12 14:49:45 SharedState: INFO: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('hdfs:///user/spark/warehouse').
2022-01-12 14:49:45 SharedState: INFO: Warehouse path is 'hdfs:///user/spark/warehouse'.
2022-01-12 14:49:45 ServerInfo: INFO: Adding filter to /SQL: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
2022-01-12 14:49:45 ContextHandler: INFO: Started o.s.j.s.ServletContextHandler@8b80fca{/SQL,null,AVAILABLE,@Spark}
2022-01-12 14:49:45 ServerInfo: INFO: Adding filter to /SQL/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
2022-01-12 14:49:45 ContextHandler: INFO: Started o.s.j.s.ServletContextHandler@756976cc{/SQL/json,null,AVAILABLE,@Spark}
2022-01-12 14:49:45 ServerInfo: INFO: Adding filter to /SQL/execution: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
2022-01-12 14:49:45 ContextHandler: INFO: Started o.s.j.s.ServletContextHandler@4ab04c04{/SQL/execution,null,AVAILABLE,@Spark}
2022-01-12 14:49:45 ServerInfo: INFO: Adding filter to /SQL/execution/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
2022-01-12 14:49:45 ContextHandler: INFO: Started o.s.j.s.ServletContextHandler@562fdca4{/SQL/execution/json,null,AVAILABLE,@Spark}
2022-01-12 14:49:45 ServerInfo: INFO: Adding filter to /static/sql: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
2022-01-12 14:49:45 ContextHandler: INFO: Started o.s.j.s.ServletContextHandler@109f0478{/static/sql,null,AVAILABLE,@Spark}
2022-01-12 14:49:46 SparkSession$Builder: WARN: Using an existing SparkSession; the static sql configurations will not take effect.
2022-01-12 14:49:46 SparkSession$Builder: WARN: Using an existing SparkSession; some spark core configurations may not take effect.
2022-01-12 14:49:47 root: INFO: RegionPool: initialized for thread 112: Thread-60
2022-01-12 14:49:48 MemoryStore: INFO: Block broadcast_0 stored as values in memory (estimated size 336.9 KiB, free 912.0 MiB)
2022-01-12 14:49:48 MemoryStore: INFO: Block broadcast_0_piece0 stored as bytes in memory (estimated size 32.4 KiB, free 911.9 MiB)
2022-01-12 14:49:48 BlockManagerInfo: INFO: Added broadcast_0_piece0 in memory on ip-xx-xx-xx-xxx.ec2.internal:39933 (size: 32.4 KiB, free: 912.3 MiB)
2022-01-12 14:49:48 SparkContext: INFO: Created broadcast 0 from broadcast at SparkBackend.scala:278
2022-01-12 14:49:49 ClientConfigurationFactory: INFO: Set initial getObject socket timeout to 2000 ms.
2022-01-12 14:49:50 root: INFO: globbing path s3://bucket_name/folder_for_microarray_imputed/raw/ukb22828_chr22_v3.bgen returned 1 files: ukb22828_chr22_v3.bgen
2022-01-12 14:49:50 S3NativeFileSystem: INFO: Opening 's3://bucket_name/folder_for_microarray_imputed/raw/ukb22828_chr22_v3.bgen' for reading
2022-01-12 14:49:50 MemoryStore: INFO: Block broadcast_1 stored as values in memory (estimated size 34.7 KiB, free 911.9 MiB)
2022-01-12 14:49:50 MemoryStore: INFO: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.2 KiB, free 911.9 MiB)
2022-01-12 14:49:50 BlockManagerInfo: INFO: Added broadcast_1_piece0 in memory on ip-xx-xx-xx-xxx.ec2.internal:39933 (size: 3.2 KiB, free: 912.3 MiB)
2022-01-12 14:49:50 SparkContext: INFO: Created broadcast 1 from broadcast at SparkBackend.scala:278
2022-01-12 14:49:51 root: INFO: instruction count: 3: __C1FSContainer.<init>
2022-01-12 14:49:51 root: INFO: instruction count: 3: __C1FSContainer.<clinit>
2022-01-12 14:49:51 root: INFO: instruction count: 3: __C2bgen_rdd_decoder.<init>
2022-01-12 14:49:51 root: INFO: instruction count: 437: __C2bgen_rdd_decoder.apply
2022-01-12 14:49:51 root: INFO: instruction count: 16: __C2bgen_rdd_decoder.apply
2022-01-12 14:49:51 root: INFO: instruction count: 9: __C2bgen_rdd_decoder.setPartitionIndex
2022-01-12 14:49:51 root: INFO: instruction count: 4: __C2bgen_rdd_decoder.addPartitionRegion
2022-01-12 14:49:51 root: INFO: instruction count: 4: __C2bgen_rdd_decoder.setPool
2022-01-12 14:49:51 root: INFO: instruction count: 3: __C2bgen_rdd_decoder.addFS
2022-01-12 14:49:51 root: INFO: decoder cache miss (0 hits, 1 misses, 0.000
2022-01-12 14:49:51 root: INFO: instruction count: 3: __C13FSContainer.<init>
2022-01-12 14:49:51 root: INFO: instruction count: 3: __C13FSContainer.<clinit>
2022-01-12 14:49:51 root: INFO: instruction count: 3: __C14etypeDecode.<init>
2022-01-12 14:49:51 root: INFO: instruction count: 9: __C14etypeDecode.apply
2022-01-12 14:49:51 root: INFO: instruction count: 34: __C14etypeDecode.__m15DECODE_o_struct_of_r_int64ANDr_array_of_r_struct_of_o_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryENDANDr_int64ANDr_struct_of_ENDENDEND_TO_SBaseStructPointer
2022-01-12 14:49:51 root: INFO: instruction count: 10: __C14etypeDecode.__m16INPLACE_DECODE_r_int64_TO_r_int64
2022-01-12 14:49:51 root: INFO: instruction count: 64: __C14etypeDecode.__m17INPLACE_DECODE_r_array_of_r_struct_of_o_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryENDANDr_int64ANDr_struct_of_ENDEND_TO_r_array_of_r_struct_of_o_struct_of_o_locusANDo_array_of_o_stringENDANDr_int64ANDr_struct_of_ENDEND
2022-01-12 14:49:51 root: INFO: instruction count: 60: __C14etypeDecode.__m18INPLACE_DECODE_r_struct_of_o_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryENDANDr_int64ANDr_struct_of_ENDEND_TO_r_struct_of_o_struct_of_o_locusANDo_array_of_o_stringENDANDr_int64ANDr_struct_of_ENDEND
2022-01-12 14:49:51 root: INFO: instruction count: 71: __C14etypeDecode.__m19INPLACE_DECODE_o_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryEND_TO_o_struct_of_o_locusANDo_array_of_o_stringEND
2022-01-12 14:49:51 root: INFO: instruction count: 22: __C14etypeDecode.__m20INPLACE_DECODE_o_struct_of_r_binaryANDr_int32END_TO_o_locus
2022-01-12 14:49:51 root: INFO: instruction count: 33: __C14etypeDecode.__m21INPLACE_DECODE_r_binary_TO_r_string
2022-01-12 14:49:51 root: INFO: instruction count: 10: __C14etypeDecode.__m22INPLACE_DECODE_r_int32_TO_r_int32
2022-01-12 14:49:51 root: INFO: instruction count: 120: __C14etypeDecode.__m23INPLACE_DECODE_o_array_of_o_binary_TO_o_array_of_o_string
2022-01-12 14:49:51 root: INFO: instruction count: 33: __C14etypeDecode.__m24INPLACE_DECODE_o_binary_TO_o_string
2022-01-12 14:49:51 root: INFO: instruction count: 8: __C14etypeDecode.__m25INPLACE_DECODE_r_struct_of_END_TO_r_struct_of_END
2022-01-12 14:49:51 root: INFO: decoder cache miss (0 hits, 2 misses, 0.000
2022-01-12 14:49:51 root: INFO: instruction count: 3: __C26FSContainer.<init>
2022-01-12 14:49:51 root: INFO: instruction count: 3: __C26FSContainer.<clinit>
2022-01-12 14:49:51 root: INFO: instruction count: 3: __C27etypeDecode.<init>
2022-01-12 14:49:51 root: INFO: instruction count: 9: __C27etypeDecode.apply
2022-01-12 14:49:51 root: INFO: instruction count: 27: __C27etypeDecode.__m28DECODE_o_struct_of_r_array_of_r_struct_of_r_int64ANDr_int64ANDo_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryENDANDr_int64ANDr_struct_of_ENDENDEND_TO_SBaseStructPointer
2022-01-12 14:49:51 root: INFO: instruction count: 64: __C27etypeDecode.__m29INPLACE_DECODE_r_array_of_r_struct_of_r_int64ANDr_int64ANDo_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryENDANDr_int64ANDr_struct_of_ENDEND_TO_r_array_of_r_struct_of_r_int64ANDr_int64ANDo_struct_of_o_locusANDo_array_of_o_stringENDANDr_int64ANDr_struct_of_ENDEND
2022-01-12 14:49:51 root: INFO: instruction count: 74: __C27etypeDecode.__m30INPLACE_DECODE_r_struct_of_r_int64ANDr_int64ANDo_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryENDANDr_int64ANDr_struct_of_ENDEND_TO_r_struct_of_r_int64ANDr_int64ANDo_struct_of_o_locusANDo_array_of_o_stringENDANDr_int64ANDr_struct_of_ENDEND
2022-01-12 14:49:51 root: INFO: instruction count: 10: __C27etypeDecode.__m31INPLACE_DECODE_r_int64_TO_r_int64
2022-01-12 14:49:51 root: INFO: instruction count: 71: __C27etypeDecode.__m32INPLACE_DECODE_o_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryEND_TO_o_struct_of_o_locusANDo_array_of_o_stringEND
2022-01-12 14:49:51 root: INFO: instruction count: 22: __C27etypeDecode.__m33INPLACE_DECODE_o_struct_of_r_binaryANDr_int32END_TO_o_locus
2022-01-12 14:49:51 root: INFO: instruction count: 33: __C27etypeDecode.__m34INPLACE_DECODE_r_binary_TO_r_string
2022-01-12 14:49:51 root: INFO: instruction count: 10: __C27etypeDecode.__m35INPLACE_DECODE_r_int32_TO_r_int32
2022-01-12 14:49:51 root: INFO: instruction count: 120: __C27etypeDecode.__m36INPLACE_DECODE_o_array_of_o_binary_TO_o_array_of_o_string
2022-01-12 14:49:51 root: INFO: instruction count: 33: __C27etypeDecode.__m37INPLACE_DECODE_o_binary_TO_o_string
2022-01-12 14:49:51 root: INFO: instruction count: 8: __C27etypeDecode.__m38INPLACE_DECODE_r_struct_of_END_TO_r_struct_of_END
2022-01-12 14:49:52 root: INFO: instruction count: 3: __C39FSContainer.<init>
2022-01-12 14:49:52 root: INFO: instruction count: 3: __C39FSContainer.<clinit>
2022-01-12 14:49:52 root: INFO: instruction count: 3: __C40indexwriter.<init>
2022-01-12 14:49:52 root: INFO: instruction count: 372: __C40indexwriter.apply
2022-01-12 14:49:52 root: INFO: instruction count: 674: __C40indexwriter.__m49writeInternalNode
2022-01-12 14:49:52 root: INFO: instruction count: 37: __C40indexwriter.__m50ENCODE_SBaseStructPointer_TO_o_struct_of_r_array_of_r_struct_of_r_int64ANDr_int64ANDo_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryENDANDr_int64ANDr_struct_of_ENDENDEND
2022-01-12 14:49:52 root: INFO: instruction count: 37: __C40indexwriter.__m51ENCODE_SIndexablePointer_TO_r_array_of_r_struct_of_r_int64ANDr_int64ANDo_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryENDANDr_int64ANDr_struct_of_ENDEND
2022-01-12 14:49:52 root: INFO: instruction count: 86: __C40indexwriter.__m52ENCODE_SBaseStructPointer_TO_r_struct_of_r_int64ANDr_int64ANDo_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryENDANDr_int64ANDr_struct_of_ENDEND
2022-01-12 14:49:52 root: INFO: instruction count: 4: __C40indexwriter.__m53ENCODE_SInt64$_TO_r_int64
2022-01-12 14:49:52 root: INFO: instruction count: 137: __C40indexwriter.__m54ENCODE_SBaseStructPointer_TO_o_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryEND
2022-01-12 14:49:52 root: INFO: instruction count: 29: __C40indexwriter.__m55ENCODE_SCanonicalLocusPointer_TO_o_struct_of_r_binaryANDr_int32END
2022-01-12 14:49:52 root: INFO: instruction count: 16: __C40indexwriter.__m56ENCODE_SStringPointer_TO_r_binary
2022-01-12 14:49:52 root: INFO: instruction count: 4: __C40indexwriter.__m57ENCODE_SInt32$_TO_r_int32
2022-01-12 14:49:52 root: INFO: instruction count: 117: __C40indexwriter.__m58ENCODE_SIndexablePointer_TO_o_array_of_o_binary
2022-01-12 14:49:52 root: INFO: instruction count: 16: __C40indexwriter.__m59ENCODE_SStringPointer_TO_o_binary
2022-01-12 14:49:52 root: INFO: instruction count: 1: __C40indexwriter.__m60ENCODE_SBaseStructPointer_TO_r_struct_of_END
2022-01-12 14:49:52 root: INFO: instruction count: 574: __C40indexwriter.__m62writeLeafNode
2022-01-12 14:49:52 root: INFO: instruction count: 51: __C40indexwriter.__m63ENCODE_SBaseStructPointer_TO_o_struct_of_r_int64ANDr_array_of_r_struct_of_o_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryENDANDr_int64ANDr_struct_of_ENDENDEND
2022-01-12 14:49:52 root: INFO: instruction count: 37: __C40indexwriter.__m64ENCODE_SIndexablePointer_TO_r_array_of_r_struct_of_o_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryENDANDr_int64ANDr_struct_of_ENDEND
2022-01-12 14:49:52 root: INFO: instruction count: 58: __C40indexwriter.__m65ENCODE_SBaseStructPointer_TO_r_struct_of_o_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryENDANDr_int64ANDr_struct_of_ENDEND
2022-01-12 14:49:52 root: INFO: instruction count: 52: __C40indexwriter.__m66flush
2022-01-12 14:49:52 root: INFO: instruction count: 182: __C40indexwriter.init
2022-01-12 14:49:52 root: INFO: instruction count: 4: __C40indexwriter.setObjects
2022-01-12 14:49:52 root: INFO: instruction count: 28: __C40indexwriter.close
2022-01-12 14:49:52 root: INFO: instruction count: 7: __C40indexwriter.trackedOS
2022-01-12 14:49:52 root: INFO: instruction count: 9: __C40indexwriter.setPartitionIndex
2022-01-12 14:49:52 root: INFO: instruction count: 4: __C40indexwriter.addPartitionRegion
2022-01-12 14:49:52 root: INFO: instruction count: 4: __C40indexwriter.setPool
2022-01-12 14:49:52 root: INFO: instruction count: 3: __C40indexwriter.addFS
2022-01-12 14:49:52 root: INFO: instruction count: 3: __C70__m49writeInternalNodeSpills.<init>
2022-01-12 14:49:52 root: INFO: instruction count: 3: __C73__m62writeLeafNodeSpills.<init>
2022-01-12 14:49:52 MemoryStore: INFO: Block broadcast_2 stored as values in memory (estimated size 60.6 KiB, free 911.8 MiB)
2022-01-12 14:49:52 MemoryStore: INFO: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1495.0 B, free 911.8 MiB)
2022-01-12 14:49:52 BlockManagerInfo: INFO: Added broadcast_2_piece0 in memory on ip-xx-xx-xx-xxx.ec2.internal:39933 (size: 1495.0 B, free: 912.3 MiB)
2022-01-12 14:49:52 SparkContext: INFO: Created broadcast 2 from broadcast at RVDPartitioner.scala:93
2022-01-12 14:49:52 root: INFO: encoder cache miss (0 hits, 1 misses, 0.000)
2022-01-12 14:49:52 root: INFO: instruction count: 3: __C74FSContainer.<init>
2022-01-12 14:49:52 root: INFO: instruction count: 3: __C74FSContainer.<clinit>
2022-01-12 14:49:52 root: INFO: instruction count: 3: __C75etypeEncode.<init>
2022-01-12 14:49:52 root: INFO: instruction count: 9: __C75etypeEncode.apply
2022-01-12 14:49:52 root: INFO: instruction count: 187: __C75etypeEncode.__m76ENCODE_SBaseStructPointer_TO_r_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryANDo_int64ANDo_int32END
2022-01-12 14:49:52 root: INFO: instruction count: 29: __C75etypeEncode.__m77ENCODE_SCanonicalLocusPointer_TO_o_struct_of_r_binaryANDr_int32END
2022-01-12 14:49:52 root: INFO: instruction count: 16: __C75etypeEncode.__m78ENCODE_SStringPointer_TO_r_binary
2022-01-12 14:49:52 root: INFO: instruction count: 4: __C75etypeEncode.__m79ENCODE_SInt32$_TO_r_int32
2022-01-12 14:49:52 root: INFO: instruction count: 117: __C75etypeEncode.__m80ENCODE_SIndexablePointer_TO_o_array_of_o_binary
2022-01-12 14:49:52 root: INFO: instruction count: 16: __C75etypeEncode.__m81ENCODE_SStringPointer_TO_o_binary
2022-01-12 14:49:52 root: INFO: instruction count: 4: __C75etypeEncode.__m82ENCODE_SInt64$_TO_o_int64
2022-01-12 14:49:52 root: INFO: instruction count: 4: __C75etypeEncode.__m83ENCODE_SInt32$_TO_o_int32
2022-01-12 14:49:52 root: INFO: decoder cache miss (0 hits, 3 misses, 0.000
2022-01-12 14:49:52 root: INFO: instruction count: 3: __C84FSContainer.<init>
2022-01-12 14:49:52 root: INFO: instruction count: 3: __C84FSContainer.<clinit>
2022-01-12 14:49:52 root: INFO: instruction count: 3: __C85etypeDecode.<init>
2022-01-12 14:49:52 root: INFO: instruction count: 9: __C85etypeDecode.apply
2022-01-12 14:49:52 root: INFO: instruction count: 133: __C85etypeDecode.__m86DECODE_r_struct_of_o_struct_of_r_binaryANDr_int32ENDANDo_array_of_o_binaryANDo_int64ANDo_int32END_TO_SBaseStructPointer
2022-01-12 14:49:52 root: INFO: instruction count: 22: __C85etypeDecode.__m87INPLACE_DECODE_o_struct_of_r_binaryANDr_int32END_TO_o_locus
2022-01-12 14:49:52 root: INFO: instruction count: 33: __C85etypeDecode.__m88INPLACE_DECODE_r_binary_TO_r_string
2022-01-12 14:49:52 root: INFO: instruction count: 10: __C85etypeDecode.__m89INPLACE_DECODE_r_int32_TO_r_int32
2022-01-12 14:49:52 root: INFO: instruction count: 120: __C85etypeDecode.__m90INPLACE_DECODE_o_array_of_o_binary_TO_o_array_of_o_string
2022-01-12 14:49:52 root: INFO: instruction count: 33: __C85etypeDecode.__m91INPLACE_DECODE_o_binary_TO_o_string
2022-01-12 14:49:52 root: INFO: instruction count: 10: __C85etypeDecode.__m92INPLACE_DECODE_o_int64_TO_o_int64
2022-01-12 14:49:52 root: INFO: instruction count: 10: __C85etypeDecode.__m93INPLACE_DECODE_o_int32_TO_o_int32
2022-01-12 14:49:52 SparkContext: INFO: Starting job: foreachPartition at IndexBgen.scala:116
2022-01-12 14:49:52 DAGScheduler: INFO: Registering RDD 5 (mapPartitions at ContextRDD.scala:168) as input to shuffle 0
2022-01-12 14:49:52 DAGScheduler: INFO: Got job 0 (foreachPartition at IndexBgen.scala:116) with 1 output partitions
2022-01-12 14:49:52 DAGScheduler: INFO: Final stage: ResultStage 1 (foreachPartition at IndexBgen.scala:116)
2022-01-12 14:49:52 DAGScheduler: INFO: Parents of final stage: List(ShuffleMapStage 0)
2022-01-12 14:49:52 DAGScheduler: INFO: Missing parents: List(ShuffleMapStage 0)
2022-01-12 14:49:52 DAGScheduler: INFO: Submitting ShuffleMapStage 0 (MapPartitionsRDD[5] at mapPartitions at ContextRDD.scala:168), which has no missing parents
2022-01-12 14:49:52 MemoryStore: INFO: Block broadcast_3 stored as values in memory (estimated size 50.0 KiB, free 911.8 MiB)
2022-01-12 14:49:52 MemoryStore: INFO: Block broadcast_3_piece0 stored as bytes in memory (estimated size 25.0 KiB, free 911.8 MiB)
2022-01-12 14:49:52 BlockManagerInfo: INFO: Added broadcast_3_piece0 in memory on ip-xx-xx-xx-xxx.ec2.internal:39933 (size: 25.0 KiB, free: 912.2 MiB)
2022-01-12 14:49:52 SparkContext: INFO: Created broadcast 3 from broadcast at DAGScheduler.scala:1479
2022-01-12 14:49:52 DAGScheduler: INFO: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[5] at mapPartitions at ContextRDD.scala:168) (first 15 tasks are for partitions Vector(0))
2022-01-12 14:49:52 YarnScheduler: INFO: Adding task set 0.0 with 1 tasks resource profile 0
2022-01-12 14:49:52 TaskSetManager: INFO: Starting task 0.0 in stage 0.0 (TID 0) (ip-xx-xx-xx-xx.ec2.internal, executor 1, partition 0, PROCESS_LOCAL, 4748 bytes) taskResourceAssignments Map()
2022-01-12 14:49:52 BlockManagerInfo: INFO: Added broadcast_3_piece0 in memory on ip-xx-xx-xx-xx.ec2.internal:42841 (size: 25.0 KiB, free: 25.8 GiB)
2022-01-12 14:49:54 BlockManagerInfo: INFO: Added broadcast_2_piece0 in memory on ip-xx-xx-xx-xx.ec2.internal:42841 (size: 1495.0 B, free: 25.8 GiB)
2022-01-12 14:49:54 BlockManagerInfo: INFO: Added broadcast_0_piece0 in memory on ip-xx-xx-xx-xx.ec2.internal:42841 (size: 32.4 KiB, free: 25.8 GiB)
2022-01-12 14:49:55 BlockManagerInfo: INFO: Added broadcast_1_piece0 in memory on ip-xx-xx-xx-xx.ec2.internal:42841 (size: 3.2 KiB, free: 25.8 GiB)
2022-01-12 14:51:28 BlockManagerInfo: INFO: Removed broadcast_3_piece0 on ip-xx-xx-xx-xxx.ec2.internal:44653 in memory (size: 49.0 B, free: 911.9 MiB)
2022-01-12 14:51:28 BlockManagerInfo: INFO: Removed broadcast_5_piece0 on ip-xx-xx-xx-xxx.ec2.internal:44653 in memory (size: 201.2 KiB, free: 912.1 MiB)
2022-01-12 15:39:43 HealthTracker: INFO: Removing node (ip-xx-xx-xx-xx.ec2.internal,org.apache.spark.scheduler.ExcludeOnFailureTimedOut$@266fabbc,Some(1642001983741)) from ExclusionList because Exclude on failure timeout has reached.
2022-01-12 16:18:23 Hail: INFO: copying log to 's3://bucket_name/folder_for_sandbox/hail_log.log'...

danking · January 19, 2022, 8:53pm

Hey @rodrigo.barreiro ,

We’re a little stumped by this. Can you share the full Python script you ran?

Thanks!

rodrigo.barreiro · January 19, 2022, 9:12pm

Thank you for the reply, the code is quite simple:

import hail as hl
from hail.linalg import BlockMatrix

DEFAULT_REF = "GRCh37"

hl.init(
    sc,
    default_reference=DEFAULT_REF,
    log='/tmp/hail_log.log'
)

ROOT = <S3 PATH>

filename = "ukb22828_chr22_v3"

hl.index_bgen(path=f'{ROOT}/{filename}.bgen',
              reference_genome=DEFAULT_REF)

danking · January 20, 2022, 3:25pm

@rodrigo.barreiro ,

Hmm.

OK, I just ran this:

import hail as hl
hl.index_bgen(
    'gs://.../ukb_imp_chr22_v2.bgen', 
    index_file_map={
        'gs://.../ukb_imp_chr22_v2.bgen': 'gs://danking/ukb_imp_chr22_v2.bgen.idx2'}

And it successfully generated an index file in my bucket. This process took about 10 minutes from my laptop, so it should be faster on a machine in the cloud. Our imputed chr22 BGEN file is ~34 GiB. Can you confirm that yours is also approximately the same size?

Since you’re using hl.init with the sc parameter, I strongly suspect the issue is that Spark is not properly configured. How did you install and configure Spark and Hail? There are several important Spark configuration parameters which must be set for Hail to work properly. See the last section here.

allysson · January 20, 2022, 8:53pm

Hi Dan and Tim,

Thanks for your test and feedback. I hope we are in time to carefully ask for you all to look at our infra json to make some comments, or even indicate a repo with a better json config to improve our AWS EMR infra.

This json pretty works for UKBB non-imputed array data, unfortunately it is slow and sometimes present many inexplicable ‘shut downs’. BTW it works well with pyspark interface, well it works but not so much.

[
  {
    "Classification": "emrfs-site",
    "Properties": {
      "fs.s3.maxConnections": "1000"
    }
  },
  {
    "Classification": "livy-conf",
    "Properties": {
      "livy.server.session.timeout": "5h"
    }
  },
  {
    "Classification": "spark",
    "Properties": {
      "maximizeResourceAllocation": "true"
    }
  },
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.jars": "/usr/lib/spark/jars/hail-all-spark.jar",
      "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/docker/usr/lib/hadoop-lzo/lib/*:/docker/usr/lib/hadoop/hadoop-aws.jar:/docker/usr/share/aws/aws-java-sdk/*:/docker/usr/share/aws/emr/security/conf:/docker/usr/share/aws/emr/security/lib/*:/docker/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/docker/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/docker/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/docker/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/usr/local/lib/python3.7/site-packages/hail/backend/hail-all-spark.jar",
      "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/docker/usr/lib/hadoop-lzo/lib/*:/docker/usr/lib/hadoop/hadoop-aws.jar:/docker/usr/share/aws/aws-java-sdk/*:/docker/usr/share/aws/emr/security/conf:/docker/usr/share/aws/emr/security/lib/*:/docker/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/docker/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/docker/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/docker/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/usr/local/lib/python3.7/site-packages/hail/backend/hail-all-spark.jar",
      "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
      "spark.kryo.registrator": "is.hail.kryo.HailKryoRegistrator",
      "spark.kryoserializer.buffer.max": "1g",
    }
  },
  {
    "Classification": "hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  },
  {
    "Classification": "spark-hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  }
]

We thanks a lot for any insight about the json, also if you can find some typos to fix or lacking libraries to figure out why we are not rocking. You guys are pretty helpful, I hope we can improve this config together.

Thanks,
Allysson

danking · January 20, 2022, 9:49pm

Hey @allysson !

Your EMR JSON configuration looks fine to me. Are you able to successfully run index_bgen using a cluster created with this configuration? If yes, how long does index_bgen take to complete?

Do I also understand correctly that sometimes index_bgen fails (or runs for >6 hours) and sometimes it succeeds? How often does it succeed versus fail? 50:50, 10:90, etc.?

Are you using: Amazon EMR on EC2, Amazon EMR Serverless, or Amazon EMR on EKS?

How many VMs (or instances, or cores) are in your cluster? If you start pyspark and run hl.utils.range_table(10000, n_partitions=10000)._force_count(), you should see a progress bar like this:

[Stage 28:>                                                 (1244 + 8) / 10000]

What number do you see after the plus sign? In my example, it is an 8. That number should be equal to the number of cores in your cluster. If it is not the same, then the cluster is not configured properly.

Also, can you share what the output of these two commands?

hl.version()
hl.spark_context().master

rodrigo.barreiro · January 21, 2022, 12:43pm

Hello @danking, thank you for helping us

Unfortunately, the index_bgen() always failed, but it may be due to a Livy timeout configuration. The major problem was the time to execute it.

Are you using: Amazon EMR on EC2, Amazon EMR Serverless, or Amazon EMR on EKS?

We are using EMR on EC2 machines (we tried a series of different EC2 composition, so it was not fixed).

How many VMs (or instances, or cores) are in your cluster? If you start pyspark and run hl.utils.range_table(10000, n_partitions=10000)._force_count() , you should see a progress bar like this:

We have one master core (m4.4xlarge, 16vCore, 64 Mem, 128Gb storage), 2~10 core nodes (m4.4xlarge, same specs) and 0~10 task nodes (we tried a series of m and r EC2 machines). The amount of machines are being regulated by AWS AutoScaling (so it setup more machines when there is more processing, which seems to be corresponding to the amount of tasks submitted to Spark)

Unfortunately the way we are current configured we can only use a PySpark on Jupyter Notebook, so it only prints after the Job is executed (we have a progress bar like the below). After its processed, it only prints 10000, and took 9s to finish.

In our tests with real data, the task progress got progress increments of 100+ (but there was processes like king that had +19k partitions).

hl.version()

'0.2.78-b17627756568'

`hl.spark_context().master``

````‘yarn’```

Thank you again for the support.

danking · January 21, 2022, 6:43pm

Hmm, OK. If

hl.utils.range_table(10000, n_partitions=10000)._force_count()

works successfully and seems to progress in increments of 100+, then it sounds like you probably have 6 “core nodes” [1] working correctly.

but there was processes like king that had +19k partitions

Yeah, hl.king is a very expensive operation, its runtime complexity is O(N^2*M) where N is the number of samples and M is the number of variants. For operations like this, we strongly recommend using many spot instance task nodes. I wrote a document about scaling with Hail in Google which might be helpful to you. A “preemptible” instance in Google is basically the same as a spot instance in Amazon. I would expect the auto-scaler to automatically add ~5000 cores so that you can quickly compute this operation.

I have two more ideas about hl.index_bgen. First, are you using the s3a:// protocol? That protocol is generally much faster than the other S3 protocols. Second, can you share the “executor logs” for hl.index_bgen? These should be available in the Spark “Resource Manager” website. I think you can reach that page from the Spark “History Server”.

There must be some subtle issue. hl.index_bgen should be quite quick, no more than ten minutes for chromosome 22 alone.

[1] When using Hail, you generally do not need more than 10 core nodes and usually only need two. Hail is primarily designed to work with spot instance task nodes. “Spot instances” are often substantially cheaper than normal instances.

rodrigo.barreiro · January 21, 2022, 7:56pm

Oh wow, thank you for such detailed explanations, Dan! You rock

I’ve tried now the s3a:// protocol, but no improvement. Oddly there was no log file generated in the Executor Logs pages, it seems the nodes are idle even though they are labeled as Active.

I also tried indexing it on a regular EC2 without the EMR cluster and the process took less than 5 minutes.

Here is the Executors tab in the Spark History Server, it may help.

Thank you again,
Rodrigo

danking · January 21, 2022, 8:36pm

Hmm. @rodrigo.barreiro , just to be abundantly clear: if you start a new EC2 instance, completely separate from EMR, install Hail on that instance, and run index_bgen on that instance, it completes in less than five minutes?

Mysterious… Let me give this some more thought.

rodrigo.barreiro · January 21, 2022, 8:40pm

Yes, Dan. I do download the BGEN file from S3 to the EC2 disk too.

danking · January 21, 2022, 8:45pm

Oh, interesting! The issue might be between Hail and S3. This script should allow a computer to access S3 directly. Can you try running that on an EC2 instance (not part of EMR) and then running hl.index_bgen('s3a://...')? If that is slow, then the issue is definitely with the S3 Hadoop/Spark connector. If that’s the case, I might be able to replicate your issue myself.

rodrigo.barreiro · January 25, 2022, 6:27pm

Hello Dan,

Thank you again for the amazing support.

We were able to run the index using the S3 connector config that you have sent to us and the indexing took about 25 minutes. Then we tried again in the EMR notebook and, for our surprise, it worked as well. I am still not sure of what happened, but my best guess is that s3a:// protocol made it work (because it is the only change we made in the code). In our previous test using the s3a:// we waited just about 20 minutes before stopping it (since yours test took 10 minutes, and ours in the hail alone (+bgen already download) took less then 5 minutes).

Thank you, Dan!

Topic		Replies	Views
Error Indexing BGEN files Hail Query & hailctl	3	647	February 1, 2019
Best practices for UK Biobank Imputed Data Hail Query & hailctl	9	1378	March 19, 2022
Export bgen from VDS for 14 million variants and 414k samples of AoU Hail Query & hailctl	0	18	May 26, 2025
Write Bgen Fatal Error Hail Query & hailctl	3	434	July 17, 2023
Running Hail on AWS Help [0.1]	29	3751	January 9, 2019

Index_bgen() on UKBB imputed data expected time

Related topics