Miss PL score on hom-ref genotype after run_combiner step

Hi Hail team,

My team try to use hl.de_novo() to call de novo variants from our trio base cohort.
But we figure out that the number of call is relatively low than expected. (We expected 60~120 de novo variants per family) Then, we visualized all de novo call using IGV, and we can not find any true variants
And during visualization, we figure out that all the variants we got from hl.de_novo() are located in difficult regions, such as long poly-A and/or repeat regions.

So, we trace back to the joint-called VCF. We compare the joint-called VCF from GATK genotypegvcf and hail VDS new_combiner(), and we figure out that it might cause by PL values.

This is the joint-VCF we got from hl.vds.new_combiner():

And this is the same position of joint-genotyping VCF from GATK genotypegvcf

As we can see, joint-VCF from GATK got all PL scores from GVCF, but hail’s new_combiner() treat 0/0 samples as missing.
I think it might cause problem when we try to call de novo, because when father and mother are 0/0 or 0|0, the PL score are missing in hail joint-call VCF.
Then it will not output as de novo call due to the “kid/dad/mom_linear_pl” part.
Here is the hl.de_novo() code that consider PL values from both Father, Mother, and Proband:

    kid = tm.proband_entry
    dad = tm.father_entry
    mom = tm.mother_entry

    kid_linear_pl = 10 ** (-kid.PL / 10)
    kid_pp = hl.bind(lambda x: x / hl.sum(x), kid_linear_pl)

    dad_linear_pl = 10 ** (-dad.PL / 10)
    dad_pp = hl.bind(lambda x: x / hl.sum(x), dad_linear_pl)

    mom_linear_pl = 10 ** (-mom.PL / 10)
    mom_pp = hl.bind(lambda x: x / hl.sum(x), mom_linear_pl)

    kid_ad_ratio = kid.AD[1] / hl.sum(kid.AD)
    dp_ratio = kid.DP / (dad.DP + mom.DP)

So, I’m wondering the reason Hail treat PL in 0/0 (or be more specific, allele [“ref”,“<NON_REF>”] from GVCF) as missing?

Thanks!
Po-Ying

Hey @poyingfu !

Can you share the code you used to create that VCF starting from the hl.vds.read_vds?

Sure! Thanks for helping!
Here is the code I ran, First I joint-called all my GVCF with Hail VDS new_combiner():

INPUT_GVCFs = snakemake.input.bgz_gvcfs
OUTPUT_VDS = snakemake.output.vds

combiner = hl.vds.new_combiner(
    output_path=OUTPUT_VDS,
    temp_path='hail_tmp',
    gvcf_paths=INPUT_GVCFs,
    # vds_paths=vdses,
    use_genome_default_intervals=True,
)

combiner.run()

After joint-calling, I tried to check the PL we have, in VDS, is missing or not. Here’s the way I checked:

## Read-in VDS
vds = hl.vds.read_vds('WGS_50PilotTrios.vds')
vds = hl.vds.split_multi(vds)

## Only keep F020
samples = ['WGS_020_Father_UDN738696','WGS_020_Mother_UDN359408','WGS_020_Proband_UDN272617']
vds = hl.vds.filter_samples(vds, samples, keep=True)

## VDS to Merged Sparse MT
mt = hl.vds.to_merged_sparse_mt(vds)
mt = mt.drop(mt.gvcf_info)
mt.filter_rows(mt.locus==hl.locus("chr1", 13684, reference_genome='GRCh38')).entries().show()

So, here is the missing value I want to re-assign, on first two rows at PL column.

I checked GVCF individually, using the code as below:

inputs=['CP-F0260/CP-F0260-003A/CP-F0260-003A_BP_germline.g.vcf.bgz', 'CP-F0260/CP-F0260-002U/CP-F0260-002U_BP_germline.g.vcf.bgz', 'CP-F0260/CP-F0260-001U/CP-F0260-001U_BP_germline.g.vcf.bgz']

intervals = ["chr1"]
gvcf_mt_list = hl.methods.import_gvcfs(inputs, 
                                       [hl.parse_locus_interval(x, reference_genome='GRCh38') for x in intervals])

### gvcf_mt_list: Proband, Mom, Dad
## Father
fa_mt = gvcf_mt_list[2]
fa_mt = fa_mt.drop(fa_mt.info)
fa_mt.filter_rows(fa_mt.locus==hl.locus("chr1", 1321858, reference_genome='GRCh38')).entries().show()

## Mother
mo_mt = gvcf_mt_list[1]
mo_mt = mo_mt.drop(mo_mt.info)
mo_mt.filter_rows(mo_mt.locus==hl.locus("chr1", 1321858, reference_genome='GRCh38')).entries().show()

## Proband
pb_mt = gvcf_mt_list[0]
pb_mt = pb_mt.drop(pb_mt.info)
pb_mt.filter_rows(pb_mt.locus==hl.locus("chr1", 1321858, reference_genome='GRCh38')).entries().show()

Because we also run GATK joint genotyping on same samples, we check the VCF in the same position, which have the PL score and are exactly the same with GVCF.

So, with this info, we try to re-assign the PL score to our ready-to-analysis VQSR VCF.
After we re-assign the PL back, then re-run hl.de_novo(), the result seems more reasonable, which we can find true calls in HIGH confidence category.

The Hail VDS combiner explicitly drops homozygous reference PLs. This needs to be at the least configurable, even if we leave the current behavior as the default.

Hi Tim, thanks for getting back with your thoughts. Are there any immediate plans to implement this? As it would be good for our large trio studies (e.g. PCGC) to have this for de novo calling that happens downstream. If not, my colleagues at WashU can have a stab at implementing it, if you able to point where in the code and they can build a version.

1 Like

Yes, this is absolutely an oversight to make this hardcoded (based on a particular decision the gnomAD team wanted to make) rather than configurable. It’ll be an easy change, should have a moment to do that this week. Feel free to poke at the end of the week if I’ve not posted an update!

OK, turns out this is already a configurable parameter, though I still think we should make a change to stop dropping PL by default (this is the only field handled in such an explicit and unexpected way).

You can use the argument gvcf_reference_entry_fields_to_keep in hl.vds.new_combiner to set the entry fields to keep for reference blocks, which is computed from the data by default (we take 100000 rows, find ones with END present, and look at the fields that aren’t missing, then drop GT/PGT/PL).

You should pass a list of fields here that includes each FORMAT field that is present for reference blocks, plus PL. You can look up the entry schema of the vds.reference_data you’ve already generated, subtract END, and add PL, as well.

1 Like

Hi @tpoterba

Thanks for point us the direction. I successfully ran through the hail.vds.new_combiner, as below:

combiner = hl.vds.new_combiner(
    output_path='/storage1/fs1/jin810/Active/fup/UDN_CP_download_for_Nahyun/WGS_F018_withPL.vds',
    temp_path='/storage1/fs1/jin810/Active/fup/hail_tmp',
    gvcf_paths=gvcfs,
    use_genome_default_intervals=True,
    gvcf_reference_entry_fields_to_keep=['DP','GQ','MIN_DP','PL']
)

combiner.run()

But after I want to convert VDS to sparse MT, I meet some issue:

vds = hl.vds.read_vds('/storage1/fs1/jin810/Active/fup/UDN_CP_download_for_Nahyun/WGS_F018_withPL.vds')
sparse_mt = hl.vds.to_merged_sparse_mt(vds)

The error shows:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 3
      1 vds = hl.vds.read_vds('/storage1/fs1/jin810/Active/fup/UDN_CP_download_for_Nahyun/WGS_F018_withPL.vds')
----> 3 sparse_mt = hl.vds.to_merged_sparse_mt(vds)
      7 # sparse_mt.filter_rows(sparse_mt.locus==hl.locus("chr1", 11240432, reference_genome='GRCh38')).entries().show()

File <decorator-gen-1614>:2, in to_merged_sparse_mt(vds, ref_allele_function)

File /usr/local/lib/python3.8/dist-packages/hail/typecheck/check.py:577, in _make_dec.<locals>.wrapper(__original_func, *args, **kwargs)
    574 @decorator
    575 def wrapper(__original_func, *args, **kwargs):
    576     args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577     return __original_func(*args_, **kwargs_)

File /usr/local/lib/python3.8/dist-packages/hail/vds/methods.py:159, in to_merged_sparse_mt(vds, ref_allele_function)
    157         info("to_merged_sparse_mt: using locus sequence context to fill in reference alleles at monomorphic loci.")
    158     else:
--> 159         raise ValueError("to_merged_sparse_mt: in order to construct a ref allele for reference-only sites, "
    160                          "either pass a function to fill in reference alleles (e.g. ref_allele_function=lambda locus: hl.missing('str'))"
    161                          " or add a sequence file with 'hl.get_reference(RG_NAME).add_sequence(FASTA_PATH)'.")
    162 ht = ht.select(
    163     alleles=hl.coalesce(ht['alleles'], hl.array([ref_allele_function(ht)])),
    164     # handle cases where vmt is not keyed by alleles
    165     **{k: ht[k] for k in vds.variant_data.row_value if k != 'alleles'},
    166     _entries=merge_arrays(ht['_ref_entries'], ht['_var_entries'])
    167 )
    168 ht = ht._key_by_assert_sorted('locus', 'alleles')

ValueError: to_merged_sparse_mt: in order to construct a ref allele for reference-only sites, either pass a function to fill in reference alleles (e.g. ref_allele_function=lambda locus: hl.missing('str')) or add a sequence file with 'hl.get_reference(RG_NAME).add_sequence(FASTA_PATH)'.

So, I add ref_allele_function=lambda locus: hl.missing('str') to hl.vds.to_merged_sparse_mt and it seems works fine. But I wonder the reason we need to add this option after we config the hl.vds.new_combiner?

## with PL
vds = hl.vds.read_vds('CP-F0260_withPL.vds')
sparse_mt = hl.vds.to_merged_sparse_mt(vds, ref_allele_function=lambda locus: hl.missing('str'))

sparse_mt.filter_rows(sparse_mt.locus==hl.locus("chr1", 11240432, reference_genome='GRCh38')).entries().show()

Although, The result shows this config assign the homozygous reference PLs to [0], not the scores [0,120,1800] as GATK genotypegvcf result and gvcf.

Then, I try to test the de_novo() function with this configured joint-called VDS, and see if I can get the positive from it.
But when I try to split multi-allele, the PL score will turn to missing again.

So, is there a way we can pass homozygous reference PLs through the hl.vds.split_multi or hl.experimental.sparse_split_multi?

Or can you help me with how to re-assign PL scores from GVCF to VQSR VCF? Thanks!

I see the problem – Hail is discarding the NON_REF allele and associated PLs. I think the solution isn’t quite clear, so let me think about this over the weekend. Sorry this is causing so many problems for you!

Really appreciate your help! Have a nice weekend!

Hi Hail team,

I found a way re-assign the PL scores back to VQSR VCF, and then, run hl.de_novo().
The result table shows that hl.de_novo() worked great, as I was able to identify all the true variants.

Really thanks for helping this!

Best,
Po-Ying

1 Like

Dear Hail Team,

This is David, at the same team as Po-Ying, who recently left her position. I am reaching out to follow up this issue. We have found that while we can successfully use hl.denovo by re-assigning the PL from their related g.vcf file in whole exome data, we are encountering difficulties in the whole genome data. I was wondering if you could suggest any methods, such as regenerating the PL and GQ score (like GenotypeGVCFs), or an updated version of the combiner that we can try?

Thank you for your time and help.

Bests,

David

Dear Hail Team,

I have conducted further investigations on the issue at hand and used the joint VCF file by GenomicDBImport followed by GenotypeGVCF as a positive control to perform the hl.denovo test. I tested on both WES and WGS data.

The reason why we were able to successfully re-assign the PL value from each individual g.vcf file is seems because when we performed HaplotypeCaller for WES, we applied ‘BP_RESOLUTION.’ However, for WGS data, we typically do not use this option as it generates too much unwanted information, resulting in some g.vcf files not having information for specific variants in certain samples. As a result, we cannot re-assign the PL value for those variants in that sample.

However, using GenomicDBImport and GenotypeGVCF, we can get all the variants information for each joint sample. I assume that is because this process recalculate the PL and GQ score back to each variant (Please correct me if I am wrong).

For example, for the variant located at chr1 43323393, we could only retrieve specific information from the g.vcf files for the proband and mother, and re-assign it back to the joint matrix table using hail.vds.combiner. However, the joint file generated by GenomicDBImport and GenotypeGVCF could have all the information for each sample.

On the other hand, for the variant at chr1 63771304, we could obtain specific information for all samples, allowing us to re-assign all PL values and successfully identify it in hl.denovo.

I hope this information provides more clarity and insight into the issue at hand.

Thank you for your support.

Best regards,

David