I’m testing on a small (1.2GB) dataset and I would like to filter variants from a vcf file overlapping many short intervals. When I filter by a single interval, it’s taking >12 s to count.
This seems to be much slower than I would expect based on the results in this blog post (16GB on a laptop in 300ms.) I’ve got 40 cores and 256GB memory, and have seen high CPU and memory use. How is my use different than the example? Why is it taking so long when it says it is only accessing 1 of 38 partitions?
Here’s my code, and the matrix table description. Apologies if I’m missing something basic.
%time wes_genotypes.count()
CPU times: user 20 ms, sys: 10.6 ms, total: 30.6 ms Wall time: 34.2 s
(622351, 520)
intervals = [hl.parse_locus_interval('1:10027520-10027542')]
exon_vars = hl.filter_intervals(wes_genotypes,intervals,keep=True)
%time exon_vars.count()
CPU times: user 5.01 ms, sys: 0 ns, total: 5.01 ms Wall time: 12.8 s
(1, 520)
wes_genotypes.describe()
(omitting large number of INFO row columns)
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
----------------------------------------
Row fields:
'locus': locus<GRCh37>
'alleles': array<str>
'rsid': str
'qual': float64
'filters': set<str>
'info': struct {
...
----------------------------------------
Entry fields:
'AD': array<int32>
'DP': int32
'GQ': int32
'GT': call
'MIN_DP': int32
'PL': array<int32>
'SB': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------