Hail tables and matrix tables are recipes: each method you call adds steps to the recipe. “Action” methods like “count”, “write”, and “collect” actually execute the recipe. This count is executing all the steps in your recipe. MatrixTable.make_table
is a fairly inefficient method: it takes the compact vector representation of a Hail MatrixTable and converts it into an tabular form. What are you trying to achieve with make_table
?
What do read_mt
and filter_mt
do? The fastest possible filtering will happen when you read from a Hail Matrix Table format (not a VCF) and when you use a filter_rows
command that performs equalities or inequalities on the mt.locus
. This is generally faster than joining against a table of locus intervals. We should teach Hail to be smarter, but until then, for small tables of intervals (say, less than 100), you should do this instead:
intervals = interval_table.interval.collect()
mt = mt.key_rows_by('locus')
mt = mt.filter_rows(
hl.literal(intervals).any(lambda x: x.contains(mt.locus))
)
Where does gnomAD_ht
come from?
When you write ~hl.is_nan(...)
, are you trying to check for missing values in the gnomad_AF array? NaN (not a number) and missing values are distinct in Hail. You should use ~hl.is_missing(...)
to check for missing values. NaN
only appears when you divide zero by zero or perform similar invalid floating point operations.
When you write
filtered_mt = filtered_mt.filter_entries((filtered_mt.GT != hl.Call([0,0], phased=True)) & (filtered_mt.GT != hl.Call([0,0])))
it sounds like you’re trying to check for non-homozygotes, is that right? Use .is_hom_ref()
:
filtered_mt = filtered_mt.filter_entries(~filtered_mt.is_hom_ref())
Hail is designed to run efficiently on all inputs massively in parallel. If you have a list of chromosome-chunked VCF files, you should import them together like this:
vcfs = hl.import_vcf(os.listdir('PATH')
vcfs.write('.../dataset.mt')
Instead of looping over them.
Are you trying to count the total number of homozygotes passing your filter conditions? Try this instead:
mt = hl.read_matrix_table(...)
# filter to regions of interest
intervals = interval_table.interval.collect()
mt = mt.key_rows_by('locus')
mt = mt.filter_rows(
hl.literal(intervals).any(lambda x: x.contains(mt.locus))
)
# filter to rare gnomad variants
mt = mt.annotate_rows(g = gnomAD_ht[mt.locus, mt.alleles])
mt = mt.filter_rows(hl.all(
mt.g.AF[0] < 0.001,
~hl.is_missing(mt.g.AF[0])
))
# count hom refs matching our filter criteria
result = mt.aggregate_entries(
hl.agg.count_where(hl.all(
mt.DP >= 20,
~mt.GT.is_hom_ref()
))
)