Stage contains a task of very large size

We have a very small VCF (less than 5M after vep annotation), we tried to write it into mt file after annotated with some references. However, it took forever to finish. We ran for more then 6 hours for a time which doesn’t sounds right. We had some other VCFs, like 100 times bigger than this one, we finished the annotation and writing to MT file without problem.
Our workflow step is very similar to the one below.

I reviewed the hail log, the only specious lines are listed below, I am wondering what might cause a task to be too large? Thank you! We are still using hail 0.2.57. I interrupted this job for this time.

2022-04-01 03:29:38 DAGScheduler: INFO: Submitting 12 missing tasks from ResultStage 6 (MapPartitionsRDD[256] at mapPartitions at ContextRDD.scala:160) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11))
2022-04-01 03:29:38 YarnScheduler: INFO: Adding task set 6.0 with 12 tasks
2022-04-01 03:29:38 TaskSetManager: WARN: Stage 6 contains a task of very large size (5223 KB). The maximum recommended task size is 100 KB.

hail-20220401-0324-0.2.57-582b2e31b8bd.txt (4.1 MB)

Hey @SimonLi5601 !

It’s a bit hard to comment without the exact code that y’all are executing. Is it possible to share that?

Also, how many variants did you start with? How many samples did you start with? Generally, things are quite a bit slower when starting with a VCF as opposed to a Hail MatrixTable. Seeing the exact code you ran will help us nail down the source of slowness.

I think the problem here is a lack of parallelism within your VCF. Joining a small left-side dataset against very large right-side datasets leads to very inefficient execution in Hail right now. You can fix this by importing your VCF with a bunch of partitions, using the default (file size divided by ~32MB) only gave you 12.

hl.import_vcf(path, min_partitions=1000)

@danking @tpoterba Thanks for your reply. When we import_vcf, we did set it min_partitions to 500. Somehow after VEP annotation, it is reset to 12. I used mt.repartition to 1000, it didn’t help much. It slows down significantly in the stage when 30% tasks slow. It could be a combination issue of data and software. We are trying to upgrade Hail (Hadoop, Spark, ElasticSearch accordantly).

Hi, @danking @tpoterba
I would like to follow up the issue we had before. Unfortunately, we still have a long way to upgrade Hail due to infrastructure limitation. We want to double check if we missed anything before. We suspected if it didn’t go throuph VQSR? But it seems if I just import VCF and export as mt file, it works. The only problem after we join the reference data as I highlighted the codes above, it took forever to finish. We processed a lot of WGS and didn’t have the same problem as this one since it only contains only less than 700 variants. Do you think it might have something special in the VEP annotation? Thanks!

I pasted two variants as example after VEP annotation in case that helps.

chr2	201209484	rs17860405	A	G	157158	.	AC=2;AF=0.067;AN=30;AS_BaseQRankSum=4.75;AS_FS=0;AS_InbreedingCoeff=-0.0714;AS_MQ=60;AS_MQRankSum=0.15;AS_QD=12.78;AS_ReadPosRankSum=0.85;AS_SOR=0.697;BaseQRankSum=5.2;DB;DP=32928;ExcessHet=3.1627;FS=0;InbreedingCoeff=-0.0714;MLEAC=2;MLEAF=0.067;MQ=60;MQRankSum=0.469;QD=12.78;ReadPosRankSum=1.47;SOR=0.689;CSQ=G|missense_variant|MODERATE|CASP10|ENSG00000003400|Transcript|ENST00000272879|protein_coding|9/10||ENST00000272879.9:c.1337A>G|ENSP00000272879.5:p.Tyr446Cys|1521|1337|446|Y/C|tAt/tGt|rs17860405&CM060890|1||1||SNV|1|HGNC|HGNC:1500||||2||CCDS2338.1|ENSP00000272879|Q92851.219||UPI000004466C|Q92851-1|1|tolerated(0.09)|possibly_damaging(0.677)|Gene3D:3.40.50.1460&Pfam:PF00656&PROSITE_profiles:PS50207&PANTHER:PTHR10454&PANTHER:PTHR10454:SF26&SMART:SM00115&Superfamily:SSF52129&CDD:cd00032|||0.0128|0.0008|0.0231|0|0.0417|0.0051|0.007036|0.03744|0.02961|0.006645|0.0189|0.01234|0.0001088|0.06514|0.04129|0.03525|0.00778|0.06514|gnomAD_FIN|benign||1&1|20301287&22056502&16446975&31249631|||||||||,G|missense_variant|MODERATE|CASP10|ENSG00000003400|Transcript|ENST00000286186|protein_coding|9/10||ENST00000286186.11:c.1337A>G|ENSP00000286186.6:p.Tyr446Cys|1512|1337|446|Y/C|tAt/tGt|rs17860405&CM060890|1||1||SNV|1|HGNC|HGNC:1500|YES|NM_032977.4||1|P2|CCDS2340.1|ENSP00000286186|Q92851.219|A0A0S2Z3Z5.30|UPI0000074732|Q92851-4|1|tolerated(0.14)|benign(0.414)|Gene3D:3.40.50.1460&Pfam:PF00656&PROSITE_profiles:PS50207&PANTHER:PTHR10454&PANTHER:PTHR10454:SF26&SMART:SM00115&Superfamily:SSF52129&CDD:cd00032|||0.0128|0.0008|0.0231|0|0.0417|0.0051|0.007036|0.03744|0.02961|0.006645|0.0189|0.01234|0.0001088|0.06514|0.04129|0.03525|0.00778|0.06514|gnomAD_FIN|benign||1&1|20301287&22056502&16446975&31249631|||||||||,G|missense_variant|MODERATE|CASP10|ENSG00000003400|Transcript|ENST00000313728|protein_coding|7/8||ENST00000313728.11:c.1136A>G|ENSP00000314599.7:p.Tyr379Cys|1260|1136|379|Y/C|tAt/tGt|rs17860405&CM060890|1||1||SNV|1|HGNC|HGNC:1500||||1||CCDS56159.1|ENSP00000314599|Q92851.219||UPI0000421EE8|Q92851-6|1|tolerated(0.13)|possibly_damaging(0.482)|Gene3D:3.40.50.1460&Pfam:PF00656&PROSITE_profiles:PS50207&PANTHER:PTHR10454&PANTHER:PTHR10454:SF26&SMART:SM00115&Superfamily:SSF52129&CDD:cd00032|||0.0128|0.0008|0.0231|0|0.0417|0.0051|0.007036|0.03744|0.02961|0.006645|0.0189|0.01234|0.0001088|0.06514|0.04129|0.03525|0.00778|0.06514|gnomAD_FIN|benign||1&1|20301287&22056502&16446975&31249631|||||||||,G|missense_variant|MODERATE|CASP10|ENSG00000003400|Transcript|ENST00000346817|protein_coding|7/8||ENST00000346817.9:c.1208A>G|ENSP00000237865.7:p.Tyr403Cys|1355|1208|403|Y/C|tAt/tGt|rs17860405&CM060890|1||1||SNV|1|HGNC|HGNC:1500||||5|A2|CCDS2339.1|ENSP00000237865|Q92851.219|A0A0S2Z3G5.39|UPI000013CA28|Q92851-2|1|tolerated(0.16)|benign(0.411)|Gene3D:3.40.50.1460&Pfam:PF00656&PROSITE_profiles:PS50207&PANTHER:PTHR10454&PANTHER:PTHR10454:SF26&SMART:SM00115&Superfamily:SSF52129&CDD:cd00032|||0.0128|0.0008|0.0231|0|0.0417|0.0051|0.007036|0.03744|0.02961|0.006645|0.0189|0.01234|0.0001088|0.06514|0.04129|0.03525|0.00778|0.06514|gnomAD_FIN|benign||1&1|20301287&22056502&16446975&31249631|||||||||,G|3_prime_UTR_variant|MODIFIER|CASP10|ENSG00000003400|Transcript|ENST00000360132|protein_coding|8/9||ENST00000360132.7:c.*423A>G||1663|||||rs17860405&CM060890|1||1||SNV|1|HGNC|HGNC:1500||||5|||ENSP00000353250|Q92851.219||UPI000002ABA4|Q92851-3|1||||||0.0128|0.0008|0.0231|0|0.0417|0.0051|0.007036|0.03744|0.02961|0.006645|0.0189|0.01234|0.0001088|0.06514|0.04129|0.03525|0.00778|0.06514|gnomAD_FIN|benign||1&1|20301287&22056502&16446975&31249631|||||||||,G|downstream_gene_variant|MODIFIER|MTND5P25|ENSG00000227348|Transcript|ENST00000430499|unprocessed_pseudogene||||||||||rs17860405&CM060890|1|2799|-1||SNV|1|HGNC|HGNC:42287|YES|||||||||||||||||0.0128|0.0008|0.0231|0|0.0417|0.0051|0.007036|0.03744|0.02961|0.006645|0.0189|0.01234|0.0001088|0.06514|0.04129|0.03525|0.00778|0.06514|gnomAD_FIN|benign||1&1|20301287&22056502&16446975&31249631|||||||||,G|downstream_gene_variant|MODIFIER|CASP10|ENSG00000003400|Transcript|ENST00000438843|nonsense_mediated_decay||||||||||rs17860405&CM060890|1|1324|1||SNV|1|HGNC|HGNC:1500||||2|||ENSP00000401914||B4E3T5.91|UPI0000E07CFD||1||||||0.0128|0.0008|0.0231|0|0.0417|0.0051|0.007036|0.03744|0.02961|0.006645|0.0189|0.01234|0.0001088|0.06514|0.04129|0.03525|0.00778|0.06514|gnomAD_FIN|benign||1&1|20301287&22056502&16446975&31249631|||||||||,G|downstream_gene_variant|MODIFIER|MTND4P23|ENSG00000225796|Transcript|ENST00000447723|unprocessed_pseudogene||||||||||rs17860405&CM060890|1|3814|-1||SNV|1|HGNC|HGNC:42210|YES|||||||||||||||||0.0128|0.0008|0.0231|0|0.0417|0.0051|0.007036|0.03744|0.02961|0.006645|0.0189|0.01234|0.0001088|0.06514|0.04129|0.03525|0.00778|0.06514|gnomAD_FIN|benign||1&1|20301287&22056502&16446975&31249631|||||||||,G|missense_variant|MODERATE|CASP10|ENSG00000003400|Transcript|ENST00000448480|protein_coding|7/8||ENST00000448480.1:c.1208A>G|ENSP00000396835.1:p.Tyr403Cys|1329|1208|403|Y/C|tAt/tGt|rs17860405&CM060890|1||1||SNV|1|HGNC|HGNC:1500||||1||CCDS56160.1|ENSP00000396835|Q92851.219||UPI0000367D6F|Q92851-5|1|tolerated(0.14)|benign(0.231)|Gene3D:3.40.50.1460&Pfam:PF00656&PROSITE_profiles:PS50207&PANTHER:PTHR10454&PANTHER:PTHR10454:SF26&SMART:SM00115&Superfamily:SSF52129&CDD:cd00032|||0.0128|0.0008|0.0231|0|0.0417|0.0051|0.007036|0.03744|0.02961|0.006645|0.0189|0.01234|0.0001088|0.06514|0.04129|0.03525|0.00778|0.06514|gnomAD_FIN|benign||1&1|20301287&22056502&16446975&31249631|||||||||,G|downstream_gene_variant|MODIFIER|CASP10|ENSG00000003400|Transcript|ENST00000460140|retained_intron||||||||||rs17860405&CM060890|1|3234|1||SNV|1|HGNC|HGNC:1500||||1||||||||1||||||0.0128|0.0008|0.0231|0|0.0417|0.0051|0.007036|0.03744|0.02961|0.006645|0.0189|0.01234|0.0001088|0.06514|0.04129|0.03525|0.00778|0.06514|gnomAD_FIN|benign||1&1|20301287&22056502&16446975&31249631|||||||||,G|non_coding_transcript_exon_variant|MODIFIER|CASP10|ENSG00000003400|Transcript|ENST00000492363|processed_transcript|7/8||ENST00000492363.5:n.1245A>G||1245|||||rs17860405&CM060890|1||1||SNV|1|HGNC|HGNC:1500||||2||||||||1||||||0.0128|0.0008|0.0231|0|0.0417|0.0051|0.007036|0.03744|0.02961|0.006645|0.0189|0.01234|0.0001088|0.06514|0.04129|0.03525|0.00778|0.06514|gnomAD_FIN|benign||1&1|20301287&22056502&16446975&31249631|||||||||,G|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00001043178|promoter||||||||||rs17860405&CM060890|1||||SNV|1||||||||||||||||||||0.0128|0.0008|0.0231|0|0.0417|0.0051|0.007036|0.03744|0.02961|0.006645|0.0189|0.01234|0.0001088|0.06514|0.04129|0.03525|0.00778|0.06514|gnomAD_FIN|benign||1&1|20301287&22056502&16446975&31249631|||||||||	GT:AD:DP:GQ:PL	0/0:1749,0:1749:99:0,120,1800	0/0:1442,0:1442:99:0,120,1800	0/1:3017,2917:5944:99:71491,0,70354	0/0:1503,0:1503:99:0,120,1800	0/0:1396,0:1396:99:0,120,1800	0/0:1699,0:1699:99:0,120,1800	0/0:1420,0:1420:99:0,120,1800	0/1:3209,3150:6389:99:85687,0,83797	0/0:2219,0:2219:99:0,120,1800	0/0:1296,0:1296:99:0,120,1800	0/0:1645,0:1645:99:0,120,1800	0/0:1534,0:1534:99:0,120,1800	0/0:1289,0:1289:99:0,120,1800	0/0:1483,0:1483:99:0,120,1800	0/0:1526,0:1526:99:0,120,1800
chr4	112430666	rs61747381	G	A	181587	.	AC=2;AF=0.067;AN=30;AS_BaseQRankSum=.;AS_FS=0;AS_InbreedingCoeff=1;AS_MQ=60;AS_MQRankSum=.;AS_QD=30.77;AS_ReadPosRankSum=.;AS_SOR=0.7;DB;DP=33574;ExcessHet=0.0755;FS=0;InbreedingCoeff=1;MLEAC=2;MLEAF=0.067;MQ=60;QD=30.77;SOR=0.699;CSQ=A|synonymous_variant|LOW|ALPK1|ENSG00000073331|Transcript|ENST00000177648|protein_coding|11/16||ENST00000177648.13:c.1119G>A|ENSP00000177648.9:p.Gly373%3D|1319|1119|373|G|ggG/ggA|rs61747381|1||1||SNV|1|HGNC|HGNC:20917||||1|P2|CCDS3697.1|ENSP00000177648|Q96QP1.147||UPI000045725F|Q96QP1-1|1|||PDB-ENSP_mappings:5z2c.A&PDB-ENSP_mappings:5z2c.B&PDB-ENSP_mappings:5z2c.C&PDB-ENSP_mappings:5z2c.D&PDB-ENSP_mappings:5z2c.E&PDB-ENSP_mappings:5z2c.F&PDB-ENSP_mappings:5z2c.G&PDB-ENSP_mappings:5z2c.H&PDB-ENSP_mappings:5z2c.I&PANTHER:PTHR46747|||0.0383|0.0023|0.1081|0.001|0.0447|0.0695|0.01158|0.05547|0.06771|0.01003|0.1899|0.0278|0.0008157|0.03022|0.05631|0.06042|0.08382|0.1899|gnomAD_AMR|||||||||||||,A|synonymous_variant|LOW|ALPK1|ENSG00000073331|Transcript|ENST00000458497|protein_coding|12/17||ENST00000458497.6:c.1119G>A|ENSP00000398048.1:p.Gly373%3D|1501|1119|373|G|ggG/ggA|rs61747381|1||1||SNV|1|HGNC|HGNC:20917||||5|P2|CCDS3697.1|ENSP00000398048|Q96QP1.147||UPI000045725F|Q96QP1-1|1|||PDB-ENSP_mappings:5z2c.A&PDB-ENSP_mappings:5z2c.B&PDB-ENSP_mappings:5z2c.C&PDB-ENSP_mappings:5z2c.D&PDB-ENSP_mappings:5z2c.E&PDB-ENSP_mappings:5z2c.F&PDB-ENSP_mappings:5z2c.G&PDB-ENSP_mappings:5z2c.H&PDB-ENSP_mappings:5z2c.I&PANTHER:PTHR46747|||0.0383|0.0023|0.1081|0.001|0.0447|0.0695|0.01158|0.05547|0.06771|0.01003|0.1899|0.0278|0.0008157|0.03022|0.05631|0.06042|0.08382|0.1899|gnomAD_AMR|||||||||||||,A|synonymous_variant|LOW|ALPK1|ENSG00000073331|Transcript|ENST00000504176|protein_coding|10/15||ENST00000504176.6:c.885G>A|ENSP00000426044.2:p.Gly295%3D|1191|885|295|G|ggG/ggA|rs61747381|1||1||SNV|1|HGNC|HGNC:20917||||2|A2|CCDS58923.1|ENSP00000426044|Q96QP1.147||UPI00020657A2|Q96QP1-2|1|||PANTHER:PTHR46747|||0.0383|0.0023|0.1081|0.001|0.0447|0.0695|0.01158|0.05547|0.06771|0.01003|0.1899|0.0278|0.0008157|0.03022|0.05631|0.06042|0.08382|0.1899|gnomAD_AMR|||||||||||||,A|non_coding_transcript_exon_variant|MODIFIER|ALPK1|ENSG00000073331|Transcript|ENST00000504745|retained_intron|7/12||ENST00000504745.1:n.1607G>A||1607|||||rs61747381|1||1||SNV|1|HGNC|HGNC:20917||||2||||||||1||||||0.0383|0.0023|0.1081|0.001|0.0447|0.0695|0.01158|0.05547|0.06771|0.01003|0.1899|0.0278|0.0008157|0.03022|0.05631|0.06042|0.08382|0.1899|gnomAD_AMR|||||||||||||,A|intron_variant&NMD_transcript_variant|MODIFIER|ALPK1|ENSG00000073331|Transcript|ENST00000505127|nonsense_mediated_decay||10/14|ENST00000505127.5:c.900+1413G>A|||||||rs61747381|1||1||SNV|1|HGNC|HGNC:20917||||2|||ENSP00000425559||B3KUH8.69|UPI00003E6011||1||||||0.0383|0.0023|0.1081|0.001|0.0447|0.0695|0.01158|0.05547|0.06771|0.01003|0.1899|0.0278|0.0008157|0.03022|0.05631|0.06042|0.08382|0.1899|gnomAD_AMR|||||||||||||,A|downstream_gene_variant|MODIFIER|ALPK1|ENSG00000073331|Transcript|ENST00000508589|processed_transcript||||||||||rs61747381|1|3017|1||SNV|1|HGNC|HGNC:20917||||3||||||||1||||||0.0383|0.0023|0.1081|0.001|0.0447|0.0695|0.01158|0.05547|0.06771|0.01003|0.1899|0.0278|0.0008157|0.03022|0.05631|0.06042|0.08382|0.1899|gnomAD_AMR|||||||||||||,A|downstream_gene_variant|MODIFIER|ALPK1|ENSG00000073331|Transcript|ENST00000509209|retained_intron||||||||||rs61747381|1|4703|1||SNV|1|HGNC|HGNC:20917||||2||||||||1||||||0.0383|0.0023|0.1081|0.001|0.0447|0.0695|0.01158|0.05547|0.06771|0.01003|0.1899|0.0278|0.0008157|0.03022|0.05631|0.06042|0.08382|0.1899|gnomAD_AMR|||||||||||||,A|3_prime_UTR_variant&NMD_transcript_variant|MODIFIER|ALPK1|ENSG00000073331|Transcript|ENST00000509722|nonsense_mediated_decay|10/15||ENST00000509722.5:c.*562G>A||1154|||||rs61747381|1||1||SNV|1|HGNC|HGNC:20917||||2|||ENSP00000424492||D6RB29.49|UPI0001D3B73A||1||||||0.0383|0.0023|0.1081|0.001|0.0447|0.0695|0.01158|0.05547|0.06771|0.01003|0.1899|0.0278|0.0008157|0.03022|0.05631|0.06042|0.08382|0.1899|gnomAD_AMR|||||||||||||,A|downstream_gene_variant|MODIFIER|ALPK1|ENSG00000073331|Transcript|ENST00000512847|retained_intron||||||||||rs61747381|1|2555|1||SNV|1|HGNC|HGNC:20917||||3||||||||1||||||0.0383|0.0023|0.1081|0.001|0.0447|0.0695|0.01158|0.05547|0.06771|0.01003|0.1899|0.0278|0.0008157|0.03022|0.05631|0.06042|0.08382|0.1899|gnomAD_AMR|||||||||||||,A|downstream_gene_variant|MODIFIER|ALPK1|ENSG00000073331|Transcript|ENST00000515330|nonsense_mediated_decay||||||||||rs61747381|1|4973|1||SNV|1|HGNC|HGNC:20917||||2|||ENSP00000423978||B4E0R2.73|UPI00017A8368||1||||||0.0383|0.0023|0.1081|0.001|0.0447|0.0695|0.01158|0.05547|0.06771|0.01003|0.1899|0.0278|0.0008157|0.03022|0.05631|0.06042|0.08382|0.1899|gnomAD_AMR|||||||||||||,A|synonymous_variant|LOW|ALPK1|ENSG00000073331|Transcript|ENST00000650871|protein_coding|11/16||ENST00000650871.1:c.1119G>A|ENSP00000498374.1:p.Gly373%3D|1372|1119|373|G|ggG/ggA|rs61747381|1||1||SNV|1|HGNC|HGNC:20917|YES|NM_025144.4|||P2|CCDS3697.1|ENSP00000498374|Q96QP1.147||UPI000045725F|Q96QP1-1|1|||PDB-ENSP_mappings:5z2c.A&PDB-ENSP_mappings:5z2c.B&PDB-ENSP_mappings:5z2c.C&PDB-ENSP_mappings:5z2c.D&PDB-ENSP_mappings:5z2c.E&PDB-ENSP_mappings:5z2c.F&PDB-ENSP_mappings:5z2c.G&PDB-ENSP_mappings:5z2c.H&PDB-ENSP_mappings:5z2c.I&PANTHER:PTHR46747|||0.0383|0.0023|0.1081|0.001|0.0447|0.0695|0.01158|0.05547|0.06771|0.01003|0.1899|0.0278|0.0008157|0.03022|0.05631|0.06042|0.08382|0.1899|gnomAD_AMR|||||||||||||	GT:AD:DP:GQ:PL	0/0:2080,0:2080:99:0,120,1800	0/0:1878,0:1878:99:0,120,1800	0/0:1753,0:1753:99:0,120,1800	1/1:0,5901:5920:99:181613,17695,0	0/0:1617,0:1617:99:0,120,1800	0/0:2007,0:2007:99:0,120,1800	0/0:1602,0:1602:99:0,120,1800	0/0:2344,0:2344:99:0,120,1800	0/0:3030,0:3030:99:0,120,1800	0/0:1765,0:1765:99:0,120,1800	0/0:1938,0:1938:99:0,120,1800	0/0:1952,0:1952:99:0,120,1800	0/0:1676,0:1676:99:0,120,1800	0/0:1925,0:1925:99:0,120,1800	0/0:1958,0:1958:99:0,120,1800

Hey @SimonLi5601 !

It’s very hard to comment on what the issue is without the source code you’re running. Can you share part of the code?

Hi Danking, Yes, sure, as I mentioned in the first post, our code is very similar to the code I highlighted above. I posted our version below as well in case it does help. Literally we are just running the same steps. But we removed split_multi_hts step and decided to retain aIndex sine we assume that normalization of the VCF and filtering out * - allele variants was already done, aIndex is just assigned to 1 for all variants.

def read_vcf_write_mt(self, schema_cls=SeqrVariantsAndGenotypesSchema):
        mt = self.import_vcf()
        mt = mt.annotate_rows(a_index=1)
        mt = mt.annotate_rows(alleles_old=mt.alleles, locus_old=mt.locus)
        mt = mt.key_rows_by(locus=hl.min_rep(mt.locus, mt.alleles)[0], alleles=hl.min_rep(mt.locus, mt.alleles)[1])
        if self.validate:
            self.validate_mt(mt, self.genome_version, self.sample_type)
        if self.remap_path:
            mt = self.remap_sample_ids(mt, self.remap_path)
        if self.subset_path:
            mt = self.subset_samples_and_variants(mt, self.subset_path)
        mt = HailMatrixTableTask.run_vep(mt, self.genome_version, self.vep_runner)
        ref_data = hl.read_table(self.reference_ht_path)
        clinvar = hl.read_table(self.clinvar_ht_path)
        mt = schema_cls(mt, ref_data=ref_data, clinvar_data=clinvar, hgmd_like_data=hgmd_like, hgmd_data=hgmd,
                        cidr_data=cidr, nisc_data=nisc, bgi_data=bgi, hgsc_wes_data=hgsc_wes, hgsc_wgs_data=hgsc_wgs).annotate_all(
            overwrite=True).select_annotated_mt()
        mt = mt.annotate_globals(sourceFilePath=','.join(self.source_paths),
                                 genomeVersion=self.genome_version,
                                 sampleType=self.sample_type,
                                 hail_version=pkg_resources.get_distribution('hail').version)
        mt.describe()
        mt.write(self.output().path, overwrite=True)

Changing the key off the matrix table introduces a shuffle which will probably change the partitioning. Are you sure you need to do that? The latest version of the SEQR loading pipeline doesn’t do that.

I commented out this step. I pasted the current full function.

def read_vcf_write_mt(self, schema_cls=SeqrVariantsAndGenotypesSchema):
        mt = self.import_vcf()

        # We removed split_multi_hts but decided to retain aIndex. Since we assume that normalization of the VCF and 
        # filtering out * - allele variants was already done, aIndex is just assigned to 1 for all variants
        mt = mt.annotate_rows(a_index=1)
        mt = mt.annotate_rows(alleles_old=mt.alleles, locus_old=mt.locus)
        #mt = mt.key_rows_by(locus=hl.min_rep(mt.locus, mt.alleles)[0], alleles=hl.min_rep(mt.locus, mt.alleles)[1])

        if self.validate:
            self.validate_mt(mt, self.genome_version, self.sample_type)
        if self.remap_path:
            mt = self.remap_sample_ids(mt, self.remap_path)
        if self.subset_path:
            mt = self.subset_samples_and_variants(mt, self.subset_path)
        mt = HailMatrixTableTask.run_vep(mt, self.genome_version, self.vep_runner)
        mt = mt.repartition(1000)
        ref_data = hl.read_table(self.reference_ht_path)
        clinvar = hl.read_table(self.clinvar_ht_path)

        # hgmd_like is optional
        #hgmd_like = get_hgmd_like_data(self.hgmd_like_csv_path, self.genome_version) if self.hgmd_like_csv_path else None
        # hgmd is optional.
        #hgmd = hl.read_table(self.hgmd_ht_path) if self.hgmd_ht_path else None
        # cidr, nisc, bgi, hgsc_wes, hgsc_wgs are optional
        #cidr = hl.read_table(self.cidr_ht_path) if self.cidr_ht_path else None
        #nisc = hl.read_table(self.nisc_ht_path) if self.nisc_ht_path else None
        #bgi = hl.read_table(self.bgi_ht_path) if self.bgi_ht_path else None
        #hgsc_wes = hl.read_table(self.hgsc_wes_ht_path) if self.hgsc_wes_ht_path else None
        #hgsc_wgs = hl.read_table(self.hgsc_wgs_ht_path) if self.hgsc_wgs_ht_path else None
        #ref_data= None
        hgmd_like = None
        hgmd = None
        cidr = None
        nisc = None
        bgi = None
        hgsc_wes = None
        hgsc_wgs = None

        mt = schema_cls(mt, ref_data=ref_data, clinvar_data=clinvar, hgmd_like_data=hgmd_like, hgmd_data=hgmd,
                        cidr_data=cidr, nisc_data=nisc, bgi_data=bgi, hgsc_wes_data=hgsc_wes, hgsc_wgs_data=hgsc_wgs).annotate_all(
            overwrite=True).select_annotated_mt()

        mt = mt.annotate_globals(sourceFilePath=','.join(self.source_paths),
                                 genomeVersion=self.genome_version,
                                 sampleType=self.sample_type,
                                 hail_version=pkg_resources.get_distribution('hail').version)

        mt.describe()
        #mt.write(self.output().path, stage_locally=True, overwrite=True)
        mt.write(self.output().path, overwrite=True)

And it is stuck in Stage 5 last step showed below for more than half hour, this cluster has 4 powerful worker nodes. It’s very strange.

[Stage 5:===========================================> (651 + 141) / 792]
[Stage 5:============================================> (661 + 131) / 792]
[Stage 5:============================================> (670 + 122) / 792]
[Stage 5:=============================================> (675 + 117) / 792]
[Stage 5:=============================================> (686 + 106) / 792]
[Stage 5:===============================================> (695 + 97) / 792]
[Stage 5:===============================================> (700 + 92) / 792]
[Stage 5:================================================> (705 + 87) / 792]
[Stage 5:================================================> (707 + 85) / 792]
[Stage 5:================================================> (709 + 83) / 792]

The same stage ran more than 8 hours and there are 3 variants have being processed for at least a couple hours and no error. I decide to terminate the job and I attached the hail log and spark(emr) stdout log to see if it is helpful.

hail-20220531-1634-0.2.57-582b2e31b8bd.txt (5.1 MB)
emr_stdout.txt (105.6 KB)