Export VDS to VCF

Hello,
hail version: 0.2.100-2ea2615a797a

I have a very large VDS dataset and I would like to filter and export a set of ~1000 variants as a gVCF file. I can’t find a direct method to export VDS as a VCF. I tried doing vds.to_dense_mt() and then exporting the mt as a VCF but this seems inefficient. I also tried hl.export_vcf(), but that method only supports Table or MatrixTable. It seems like hl.export_vcf() used to support VDS but not anymore. My current code is below for context.

vds = hl.vds.read_vds(wgs_vds)
vds_filtered = hl.vds.filter_variants(vds, variants_table)
vds_filtered_split = hl.vds.split_multi(vds_filtered)
mt_filtered_split = hl.vds.to_dense_mt(vds_filtered_split)
mt_filtered_split = mt_filtered_split.drop('FT')
hl.export_vcf(mt_filtered_split, os.path.join(bucket, 'test.vcf'))

Thanks,
Andrew

Hey Andrew!

Can you share more information about the run? How long did it take, how many partitions were in your progress bar, and how many cores were you using?

One easy way to improve speed is to use "test.vcf.bgz" to compress the data before writing it.

EDIT:

One more thing. you mentioned wanting a GVCF file. hl.export_vcf produces project VCF files, not GVCF files. Are you trying to read 1000 variants for exactly one sample?

EDIT2:

I think you’ll find better performance using

vds_filtered = hl.filter_intervals(
    vds,
    [hl.parse_locus_interval('chr1:10000-20000')]
)

EDIT3:

If you don’t care about the reference blocks for the individual that you’re extracting, then you shouldn’t need to convert to a dense_mt. You can extract the .variant_data and export that. Reference genotypes should appear as missing data. If you want the reference quality information, then you will indeed need the reference data, but such reference genotypes do not usually appear in a project VCF.

Hi Dan,

The variants that I’m extracting are select variants throughout the genome, not necessarily defined in 1 1000 base block. This is for a pharmacogenomic analysis so reference data is needed for appropriate haplotype characterization. My goal is to export the results as a multi-sample gVCF if possible. Worst case scenario I’m willing to settle for just variant level data.

The data used to be represented as a MatrixTable but it has since moved to a VDS for more efficient storage. Previously I used to be able to extract ~800 variants from the MatrixTable in just a few minutes for a dataset with nearly 100k WGS samples.

Using the .bgz compression total compute time was 2hrs 22 mins. This is with a VDS that is currently 10k samples. The dataset is expected to reach ~250k in the next few months. Even assuming a linear increase in compute time which it won’t be, this would take ~2 days? Additionally, the output was variant data only. None of the positions were homref/no call for all samples.

I feel like it might be converting the entire dataset to a MatrixTable, not just the variants that I selected. When I do vds_filtered_split.variant_data.show() I see only variants that I selected. However, if I do vds_filtered_split.reference_data.show(), I can see positions that are definitely not part of the variants table.

I’m also curious why there isn’t a direct option to export the VDS as a VCF? If you have ideas to make this more efficient and why the output was variant positions only, I’m willing to try.

My computing profile and partition data are below:
image

Thanks!

High level comment: my apologies for your troubles. The VDS became popular quickly and we’re delinquent on making the interface as easy to use as the MT interface. It should absolutely not be this slow to access tiny portions of data.


More comments follow, but let me get right at your issue.

Yes, the reference data is always the full reference data even if you’ve filtered the variant data (with filter_variants).

I think filter_variants is a bit of a footgun currently because we don’t take a certain fast path when you have a very small number of variants. The following is the absolute fastest way to subset to certain loci.

your_loci = hl.literal({hl.Locus('chr1', 12345), ....})
vd = vds.variant_data
vd = vd.filter_rows(your_loci.contains(vd.locus))
vds = VariantDataset(vd.reference_data, vd)

OK, my gut instinct for you would be that if all you need are GTs you should be able to do:

vd = vd.annotate_entries(GT = hl.coalesce(vd.GT, hl.call(0, 0)))
hl.export_vcf(vd, ...)

And that will produce a VCF file with just the variant data. It should run quickly using time proportional to the number of loci. I’m surprised you didn’t get any homref or no call samples in your attempt.


It sounds like you might need or want the reference information as well. Do you want the reference information just at those loci? I think (I’m not really the VDS expert, that’s @tpoterba , but he’s out this week) you’ll get better performance from this:

your_loci = [hl.Locus('chr1', 12345), ....]
vds = hl.vds.filter_intervals(
    vds,
    intervals = [hl.interval(l, l, includes_end=True) for l in your_loci],
    split_reference_blocks=True
)
hl.export_vcf(vds.to_dense_mt(), ...)

Again, this is a limitation of Hail’s optimizer to realize that, for small tables, we should use a different filtering strategy that’s more efficient.


What do you have in mind for a multi-sample GVCF? As far as I know, there isn’t an agreed upon standard for what that is.

We have something we internally call the “Sparse Variant Call Representation” or SVCR. That representation is realized in the Hail format as a Variant Dataset (VDS). You could realize that in multi-sample VCF format, but we haven’t built anything for that yet.

The reason we don’t have an hl.export_vcf for VDS is that producing a project VCF from a VDS is a super-linear operation. We could conceivably write an hl.export_svcr_vcf which uses the sparse variable call representation realized in VCF. One of many things to do :/.


Ah, do not use “workers” those are ~3x the cost for no benefit in your case. Hail is carefully designed to use preemptible workers in most cases (certainly in this case here).

Also, you really want to use an auto-scaling cluster so that you can take advantage of tons of compute massively in parallel for a short period of time. We describe how to do that for ordinary dataproc clusters, but it looks like you might be inside Terra? Are there any options for an auto-scaling cluster in there?

As an aside, a few years ago, I wrote some notes on using Hail efficiently in the cloud

Thank you that was an incredibly helpful response. I read through the documentation you sent and I changed my environment to use preemptible workers. This environment is through the All of Us Research Program so I only have access to whatever the program currently has available. Currently auto-scaling cluster is not yet supported, but might be in the works for the future.

I tried your directions for the variant data but couldn’t get it to work. It was incredibly fast though, similar to speeds that I was expecting. I did change hl.coalesce(vd.GT, hl.call(0, 0)) to hl.coalesce(vd.LGT, hl.call(0, 0)) since GT does not exist within the vd. However, the output is that all the reference calls are still NA when using vd.GT.show() and when I export the VCF all the reference calls are no calls. I can’t figure out why. Are there other attributes that I need to set in order for the calls to be recognized properly? Example below:
image

As far as structure of the gVCF. My goal is just to have a VCF where each line represents each of the variants that I am filtering for along with the call for each sample. It might be a two step process of filtering variants and reference data separately followed by combining and sorting the VCF at a later step. How come the process is different for the reference data? Can I not just use the same procedure but instead replace variant_data with reference_data?

Hmm. I think we need to wait for Tim’s insight here. He’ll ask for the exact script you’re running so that he can be certain of what’s happening.

OK, I think we’re both trying to produce the same output. Clearly, I’m missing something above. Tim will probably see the error quite quickly.

The reference data is structured differently from the variant data because it is sparse. In particular, the reference data is stored as per-sample “ref blocks” which are genomic intervals with shared quality metadata. The variant data is simpler: it’s just a dense matrix of data.


OK, I have one more idea:

your_loci = hl.literal({hl.Locus('chr1', 12345), ....})
vd = vds.variant_data
vd = vd.filter_rows(your_loci.contains(vd.locus))
vd = vd.unfilter_entries()
vd = vd.annotate_entries(
    GT = hl.coalesce(hl.vds.lgt_to_gt(vd.LGT, vd.LA), hl.call(0, 0))
)
hl.export_vcf(vd, ...)

This is a subtle point about a missing entry vs a filtered a entry. We treat these subtly differently: the latter is supposed to represent a measurement whose value we lack whereas the latter represents something we’ve chosen to exclude from our experiment entirely. It affects how, for example, a mean is calculated. I think VDS might be using filtered entries rather than missing entries. This is really an implementation detail that is unfortunately visible to you.

One concern with this approach: this assumes every non-variant genotype is homref even though it might be better treated as “no call”. Again, Tim will have a more sophisticated understanding to share on Monday.

This is AoU data, right? Can you post the schema of the VDS after reading?

vds.variant_data.describe()
vds.reference_data.describe()

We had an issue (which I think is only fixed in a branch, and definitely not in 0.2.100) where having both the GT and LGT fields in the original VDS didn’t play nicely with split_multi and/or to_dense_mt.

@danking Ah gotcha. I forgot that the reference data is structured differently. I would like to be able to differentiate between no calls and reference calls since these are treated differently in our analysis. Otherwise my approach would be to just exclude the reference data and just assume all samples are hom ref for that variant.

@tpoterba Yes this is AoU data. The schema is below for each.
vds.variant_data.describe()

----------------------------------------
Global fields:
    'tranche_data': array<struct {
        model: str, 
        truth_sensitivity: float64, 
        min_vqslod: float64, 
        filter_name: str
    }>
    'truth_sensitivity_snp_threshold': float64
    'truth_sensitivity_indel_threshold': float64
    'snp_vqslod_threshold': float64
    'indel_vqslod_threshold': float64
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'filters': set<str>
    'as_vqsr': dict<str, struct {
        model: str, 
        vqslod: float64, 
        yng_status: str
    }>
----------------------------------------
Entry fields:
    'GQ': int32
    'RGQ': int32
    'LGT': call
    'LAD': array<int32>
    'LA': array<int32>
    'FT': bool
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

vds.reference_data.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'ref_allele': str
----------------------------------------
Entry fields:
    'GQ': int32
    'END': int32
----------------------------------------
Column key: ['s']
Row key: ['locus']
----------------------------------------

Following up on this. I am getting close but still running into some issues. My code is below.
hail version is now updated to 0.2.104-1940d9e8eaab

vds = hl.vds.read_vds(wgs_vds)
vd = vds.variant_data
vd = vd.filter_rows(locis.contains(vd.locus))
vd = vd.drop('FT')
vd = vd.unfilter_entries()
vd = vd.annotate_entries(
    GT = hl.coalesce(hl.vds.lgt_to_gt(vd.LGT, vd.LA), hl.call(0, 0))
)
vcf_bucket = os.path.join(bucket, "test.vcf")
hl.export_vcf(vd, vcf_bucket)

I am able to get all the reference calls in the variant data filled, but it seems like that GT is not being filled in correctly. As an example, I have a multiallelic variant shown below. The LGT for the first sample shows that the sample carries 2 copies of the A allele.

image

However, when I export the VCF, the GT field shows the following which is not the correct call for the sample. Technically, the LGT field is still correct, but it would be confusing to use LGT for variant calls and GT for reference calls:

	#CHROM	POS	REF	ALT	FILTER	FORMAT	1234567
392	chr22	42126611	C	A,G,T	PASS	GT:GQ:RGQ:LGT:LAD:LA	2/2:99:208:1/1:0,41:0,2

I still haven’t been able to figure out how to export the reference data. I tried some of the suggestions above and I don’t think I’m close. The problem is that the above methods still leave the data as a VDS. It’s fine if the reference data is exported separately. I can always combine later. If you have other ideas for this let me know.

My other concern at the moment is in the current VDS model, is there a way to differentiate a reference call from a no call? These would be important to differentiate for downstream analysis so that these are not just assigned a reference call.

What’s the LA for these samples? LGT is a local genotype, so that number 1 does not necessarily refer to the allele A, it refers to the first allele in the local alleles array, LA. To get the sequence corresponding to the first allele in a ballelic local genotype, you can do this:

mt.alleles[mt.LA[mt.LGT[0]]]

Change to mt.LGT[1] for the second allele of a balletic local genotype.


If you use hl.vds.filter_intervals, you get a new vds. To create a dense VCF, you have to use to_dense_mt first, that’s how we achieve the “dense” part of dense VCF.


to_dense_mt does this for you. We strongly agree about the importance of reference information! Without reference information you cannot confidently distinguish between homref and no call. The VDS was designed to make that possible.

Tim and I had a chat and I think there’s one clearly missing piece: we need to be able to rapidly identify the set of reference blocks which overlap your interval or variants of interest. VDS has so far been used for genome-wide analyses (particularly the genome aggregation database project has been using it). If you want to pluck out just a handful of variants, we, unfortunately, have to scan across all the reference data (because, in principle, a reference block could span an entire chromosome). We need a fast “interval-interval” join which would let us find the set of reference blocks that overlap the set of intervals or variants of interest.

Tim and I are working out exactly how to build that, it’s a somewhat tricky CS problem, but that’s what we’ll need to serve your needs efficiently.

In the meantime, are the variants you want available in the dense matrix table provided by AoU?

I completely forgot about the LGT representation. I remember reading that in the documentation and then forgot all about it. Sorry about that. The GT is correct.

I tried using filter_intervals() for the reference data but I just kept getting stackoverflow and memory errors likely because of the reasons that you mentioned. I received some code from Lee that he created through help from your team. He used it to filter regions/samples and create a dense mt, but when I replicate the code, I keep running out of memory within the nodes during my run. I tried doubling the memory within each node without increasing number of cores, but still same issue. I suspect it’s because I am searching for so many specific variants throughout the genome rather than just a few regions. I can hold off on my analysis for now. I don’t want to rush development of something. The code I got from Lee is paraphrased below:

import hail as hl
spark_conf_more_ram = dict()
spark_conf_more_ram["spark.executor.memory"] = "6g"
spark_conf_more_ram["spark.executor.cores"] = "4"
spark_conf_more_ram["spark.driver.memory"] = "128g"
hl.init(default_reference='GRCh38', idempotent=False, spark_conf=spark_conf_more_ram)

n_parts=70000 
vds1 = hl.vds.read_vds(vds_url)
intervals = vds1.reference_data._calculate_new_partitions(n_parts)
callset = hl.vds.read_vds(vds_url, intervals=intervals)

interval_table = hl.import_bed(bed_file, reference_genome='GRCh38')
callset_filtered = hl.vds.filter_intervals(callset, interval_table, keep=True)

vd_gt = callset_filtered.variant_data.transmute_entries(
    GT = hl.vds.lgt_to_gt(callset_filtered.variant_data.LGT, callset_filtered.variant_data.LA),
    FT=hl.if_else(callset_filtered.variant_data.FT, "PASS", "FAIL")
)

d_callset = hl.vds.to_dense_mt(hl.vds.VariantDataset(callset_filtered.reference_data, vd_gt))
d_callset = d_callset.annotate_rows(gt_stats = hl.agg.call_stats(d_callset.GT, d_callset.alleles))
d_callset = d_callset.rename({"gt_stats":"info"})

hl.export_vcf(d_callset, parent_bucket + "/d_callset"+fn_append+".vcf.bgz", tabix=False)

I wasn’t sure if your question was referring to the actual matrix table that we have been using or the matrix tables within the VDS, so I’ll answer it both ways. As far as I know, AoU is no longer providing a matrix table. The data is now and likely for the future is only going to be available as a VDS. When we had just a matrixtable I was easily able to extract the variants I wanted and the ones where at least 1 sample was not hom ref were all present. In the current VDS, I have been only able to extract variant data for a few samples as a test. If I do the entire dataset, I always crash due to memory issues as described above.

I think I was able to get around this by skipping the lines below

intervals = vds1.reference_data._calculate_new_partitions(n_parts)
callset = hl.vds.read_vds(vds_url, intervals=intervals)

My guess is that the intervals breaking into Lee’s recommended 70000 parts for the whole genome was just way too much for the small PGx regions. I was able to export a dense mt and a sites only VCF from this VDS. Does that make sense to Tim and Dan as a reason for trouble?

I figured out a solution thanks to Karryne demonstrating that it is possible. I should have noticed an important aspect of the reference data that @danking mentioned which is that to identify variants in the reference data, we have to look in the entire genome. The approach now is to first look for all gene regions of interest. Write out a matrix table. Then using the matrix table filter out selected variants. Seems counter intuitive, but it works because it solves the problem of repeatedly iterating through the reference data which is what I suspect I was getting hung up on. I should have realized this sooner. Working code is below.

Filter for gene regions

interval_table = hl.import_bed(pgx_genes_bed_bucket, reference_genome='GRCh38')
callset_filtered = hl.vds.filter_intervals(vds, interval_table, keep=True)
vd_gt = callset_filtered.variant_data.transmute_entries(
    GT = hl.vds.lgt_to_gt(callset_filtered.variant_data.LGT, callset_filtered.variant_data.LA),
    FT=hl.if_else(callset_filtered.variant_data.FT, "PASS", "FAIL")
)
d_callset = hl.vds.to_dense_mt(hl.vds.VariantDataset(callset_filtered.reference_data, vd_gt))
d_callset = d_callset.annotate_rows(gt_stats = hl.agg.call_stats(d_callset.GT, d_callset.alleles))
d_callset = d_callset.rename({"gt_stats":"info"})
output_url = os.path.join(bucket, "pgx_genes.mt")
d_callset.write(output_url, overwrite=True)

Filter for variants

def create_intervals(data):
    intervals = []
    for i, (chromosome, start) in data[["CHROM", "POS"]].iterrows():
        locus = f"{chromosome}:{start}"
        parsed_locus = hl.parse_locus(locus, reference_genome="GRCh38")
        intervals.append(parsed_locus)
    return intervals

def split_multi_allelic(mt):
    bi = mt.filter_rows(hl.len(mt.alleles) == 2)
    bi = bi.annotate_rows(a_index=1, was_split=False)
    multi = mt.filter_rows(hl.len(mt.alleles) > 2)
    split = hl.split_multi_hts(multi)
    mt = split.union_rows(bi)
    return mt

mt = hl.methods.read_matrix_table(output_url)
intervals = create_intervals(variant_positions)
mt_filtered = mt.filter_rows(hl.literal(intervals).contains(mt.locus))
mt_filtered_split = split_multi_allelic(mt_filtered)
vcf_bucket = os.path.join(bucket, "pgx_variants_250k.vcf")
hl.export_vcf(mt_filtered_split, vcf_bucket)

The one potential problem (which may not be a problem) I have left is the variant call rate seems abnormally high for some of these variants. In the previous version of the dataset we had variants that were incredibly low (0.1-0.3 call rate). Now these same variants have a call rate of 1.0. This seems unlikely considering the DRAGEN version has not changed. I do see some variants that are not at a call rate of 1, but still abnormally high. Counts are below. Any idea what might be the cause or if these seems correct?

call_rate	count
1.000000	512
0.999990	3
0.986780	3
0.999970	2
0.999980	1
1 Like