Dear Hail Team,
First, thanks for this awesome software! I’m using hail for UKB analyses and am having performance issues after sample/variant filtering of a MatrixTable.
code to read/subset ukb bgens
hl_imaging = hl_df.filter(hl.is_nan(hl_df.age_at_scan) == False)
ht_variants = hl.read_table(’/gpfs/milgram/data/UKB/ukb_snp/tmp_variants.ht’)
mt = hl.import_bgen(path=bgen_files,
entry_fields=[‘GT’],
sample_file=sample_files[0],
index_file_map=index_files)
mt = mt.filter_cols(hl.is_defined(hl_imaging[mt.s]))
mt = mt.filter_rows(hl.is_defined(ht_variants[mt.locus, mt.alleles]))
…at this point, if I execute any method like mt.count_rows() or mt.write(), hail will execute using the original number of partitions for the bgen data (~18k). e.g.,
In [24]: mt.count_rows()
2018-12-18 18:04:09 Hail: INFO: Coerced sorted dataset
2018-12-18 18:04:19 Hail: INFO: Coerced sorted dataset
2018-12-18 18:04:24 Hail: INFO: Coerced sorted dataset
[Stage 38:> (0 + 20) / 18257]
I’m only on a 20 core machine, but the execution seems very slow. If I understand correctly, running:
mt = mt.repartition(500, shuffle=False)
should handle this by reducing the number of tasks being sent to each core and decrease execution time? The catch is that running .repartition() doesn’t seem to change the number of tasks being executed (i.e. it remains at 18,257). Is there another method that should be called after MatrixTable subsetting?
I’m new to the Hail environment, so this is probably a dumb mistake on my part. But any help would be hugely appreciated!
Thank you,
Kevin