High level comment: my apologies for your troubles. The VDS became popular quickly and we’re delinquent on making the interface as easy to use as the MT interface. It should absolutely not be this slow to access tiny portions of data.
More comments follow, but let me get right at your issue.
Yes, the reference data is always the full reference data even if you’ve filtered the variant data (with filter_variants
).
I think filter_variants
is a bit of a footgun currently because we don’t take a certain fast path when you have a very small number of variants. The following is the absolute fastest way to subset to certain loci.
your_loci = hl.literal({hl.Locus('chr1', 12345), ....})
vd = vds.variant_data
vd = vd.filter_rows(your_loci.contains(vd.locus))
vds = VariantDataset(vd.reference_data, vd)
OK, my gut instinct for you would be that if all you need are GTs you should be able to do:
vd = vd.annotate_entries(GT = hl.coalesce(vd.GT, hl.call(0, 0)))
hl.export_vcf(vd, ...)
And that will produce a VCF file with just the variant data. It should run quickly using time proportional to the number of loci. I’m surprised you didn’t get any homref or no call samples in your attempt.
It sounds like you might need or want the reference information as well. Do you want the reference information just at those loci? I think (I’m not really the VDS expert, that’s @tpoterba , but he’s out this week) you’ll get better performance from this:
your_loci = [hl.Locus('chr1', 12345), ....]
vds = hl.vds.filter_intervals(
vds,
intervals = [hl.interval(l, l, includes_end=True) for l in your_loci],
split_reference_blocks=True
)
hl.export_vcf(vds.to_dense_mt(), ...)
Again, this is a limitation of Hail’s optimizer to realize that, for small tables, we should use a different filtering strategy that’s more efficient.
What do you have in mind for a multi-sample GVCF? As far as I know, there isn’t an agreed upon standard for what that is.
We have something we internally call the “Sparse Variant Call Representation” or SVCR. That representation is realized in the Hail format as a Variant Dataset (VDS). You could realize that in multi-sample VCF format, but we haven’t built anything for that yet.
The reason we don’t have an hl.export_vcf
for VDS is that producing a project VCF from a VDS is a super-linear operation. We could conceivably write an hl.export_svcr_vcf
which uses the sparse variable call representation realized in VCF. One of many things to do :/.
Ah, do not use “workers” those are ~3x the cost for no benefit in your case. Hail is carefully designed to use preemptible workers in most cases (certainly in this case here).
Also, you really want to use an auto-scaling cluster so that you can take advantage of tons of compute massively in parallel for a short period of time. We describe how to do that for ordinary dataproc clusters, but it looks like you might be inside Terra? Are there any options for an auto-scaling cluster in there?
As an aside, a few years ago, I wrote some notes on using Hail efficiently in the cloud