Miss PL score on hom-ref genotype after run_combiner step

poyingfu · March 15, 2023, 4:33pm

Hi Hail team,

My team try to use hl.de_novo() to call de novo variants from our trio base cohort.
But we figure out that the number of call is relatively low than expected. (We expected 60~120 de novo variants per family) Then, we visualized all de novo call using IGV, and we can not find any true variants
And during visualization, we figure out that all the variants we got from hl.de_novo() are located in difficult regions, such as long poly-A and/or repeat regions.

So, we trace back to the joint-called VCF. We compare the joint-called VCF from GATK genotypegvcf and hail VDS new_combiner(), and we figure out that it might cause by PL values.

This is the joint-VCF we got from hl.vds.new_combiner():

And this is the same position of joint-genotyping VCF from GATK genotypegvcf

As we can see, joint-VCF from GATK got all PL scores from GVCF, but hail’s new_combiner() treat 0/0 samples as missing.
I think it might cause problem when we try to call de novo, because when father and mother are 0/0 or 0|0, the PL score are missing in hail joint-call VCF.
Then it will not output as de novo call due to the “kid/dad/mom_linear_pl” part.
Here is the hl.de_novo() code that consider PL values from both Father, Mother, and Proband:

    kid = tm.proband_entry
    dad = tm.father_entry
    mom = tm.mother_entry

    kid_linear_pl = 10 ** (-kid.PL / 10)
    kid_pp = hl.bind(lambda x: x / hl.sum(x), kid_linear_pl)

    dad_linear_pl = 10 ** (-dad.PL / 10)
    dad_pp = hl.bind(lambda x: x / hl.sum(x), dad_linear_pl)

    mom_linear_pl = 10 ** (-mom.PL / 10)
    mom_pp = hl.bind(lambda x: x / hl.sum(x), mom_linear_pl)

    kid_ad_ratio = kid.AD[1] / hl.sum(kid.AD)
    dp_ratio = kid.DP / (dad.DP + mom.DP)

So, I’m wondering the reason Hail treat PL in 0/0 (or be more specific, allele [“ref”,“<NON_REF>”] from GVCF) as missing?

Thanks!
Po-Ying

danking · March 20, 2023, 11:46pm

Hey @poyingfu !

Can you share the code you used to create that VCF starting from the hl.vds.read_vds?

poyingfu · March 21, 2023, 4:32pm

Sure! Thanks for helping!
Here is the code I ran, First I joint-called all my GVCF with Hail VDS new_combiner():

INPUT_GVCFs = snakemake.input.bgz_gvcfs
OUTPUT_VDS = snakemake.output.vds

combiner = hl.vds.new_combiner(
    output_path=OUTPUT_VDS,
    temp_path='hail_tmp',
    gvcf_paths=INPUT_GVCFs,
    # vds_paths=vdses,
    use_genome_default_intervals=True,
)

combiner.run()

After joint-calling, I tried to check the PL we have, in VDS, is missing or not. Here’s the way I checked:

## Read-in VDS
vds = hl.vds.read_vds('WGS_50PilotTrios.vds')
vds = hl.vds.split_multi(vds)

## Only keep F020
samples = ['WGS_020_Father_UDN738696','WGS_020_Mother_UDN359408','WGS_020_Proband_UDN272617']
vds = hl.vds.filter_samples(vds, samples, keep=True)

## VDS to Merged Sparse MT
mt = hl.vds.to_merged_sparse_mt(vds)
mt = mt.drop(mt.gvcf_info)
mt.filter_rows(mt.locus==hl.locus("chr1", 13684, reference_genome='GRCh38')).entries().show()

So, here is the missing value I want to re-assign, on first two rows at PL column.

I checked GVCF individually, using the code as below:

inputs=['CP-F0260/CP-F0260-003A/CP-F0260-003A_BP_germline.g.vcf.bgz', 'CP-F0260/CP-F0260-002U/CP-F0260-002U_BP_germline.g.vcf.bgz', 'CP-F0260/CP-F0260-001U/CP-F0260-001U_BP_germline.g.vcf.bgz']

intervals = ["chr1"]
gvcf_mt_list = hl.methods.import_gvcfs(inputs, 
                                       [hl.parse_locus_interval(x, reference_genome='GRCh38') for x in intervals])

### gvcf_mt_list: Proband, Mom, Dad
## Father
fa_mt = gvcf_mt_list[2]
fa_mt = fa_mt.drop(fa_mt.info)
fa_mt.filter_rows(fa_mt.locus==hl.locus("chr1", 1321858, reference_genome='GRCh38')).entries().show()

## Mother
mo_mt = gvcf_mt_list[1]
mo_mt = mo_mt.drop(mo_mt.info)
mo_mt.filter_rows(mo_mt.locus==hl.locus("chr1", 1321858, reference_genome='GRCh38')).entries().show()

## Proband
pb_mt = gvcf_mt_list[0]
pb_mt = pb_mt.drop(pb_mt.info)
pb_mt.filter_rows(pb_mt.locus==hl.locus("chr1", 1321858, reference_genome='GRCh38')).entries().show()

Because we also run GATK joint genotyping on same samples, we check the VCF in the same position, which have the PL score and are exactly the same with GVCF.

So, with this info, we try to re-assign the PL score to our ready-to-analysis VQSR VCF.
After we re-assign the PL back, then re-run hl.de_novo(), the result seems more reasonable, which we can find true calls in HIGH confidence category.

tpoterba · March 21, 2023, 5:35pm

The Hail VDS combiner explicitly drops homozygous reference PLs. This needs to be at the least configurable, even if we leave the current behavior as the default.

Monkol_Lek · March 24, 2023, 4:15pm

Hi Tim, thanks for getting back with your thoughts. Are there any immediate plans to implement this? As it would be good for our large trio studies (e.g. PCGC) to have this for de novo calling that happens downstream. If not, my colleagues at WashU can have a stab at implementing it, if you able to point where in the code and they can build a version.

tpoterba · March 26, 2023, 1:07am

Yes, this is absolutely an oversight to make this hardcoded (based on a particular decision the gnomAD team wanted to make) rather than configurable. It’ll be an easy change, should have a moment to do that this week. Feel free to poke at the end of the week if I’ve not posted an update!

tpoterba · March 28, 2023, 11:27am

OK, turns out this is already a configurable parameter, though I still think we should make a change to stop dropping PL by default (this is the only field handled in such an explicit and unexpected way).

You can use the argument gvcf_reference_entry_fields_to_keep in hl.vds.new_combiner to set the entry fields to keep for reference blocks, which is computed from the data by default (we take 100000 rows, find ones with END present, and look at the fields that aren’t missing, then drop GT/PGT/PL).

You should pass a list of fields here that includes each FORMAT field that is present for reference blocks, plus PL. You can look up the entry schema of the vds.reference_data you’ve already generated, subtract END, and add PL, as well.

poyingfu · March 29, 2023, 4:41am

Hi @tpoterba

Thanks for point us the direction. I successfully ran through the hail.vds.new_combiner, as below:

combiner = hl.vds.new_combiner(
    output_path='/storage1/fs1/jin810/Active/fup/UDN_CP_download_for_Nahyun/WGS_F018_withPL.vds',
    temp_path='/storage1/fs1/jin810/Active/fup/hail_tmp',
    gvcf_paths=gvcfs,
    use_genome_default_intervals=True,
    gvcf_reference_entry_fields_to_keep=['DP','GQ','MIN_DP','PL']
)

combiner.run()

But after I want to convert VDS to sparse MT, I meet some issue:

vds = hl.vds.read_vds('/storage1/fs1/jin810/Active/fup/UDN_CP_download_for_Nahyun/WGS_F018_withPL.vds')
sparse_mt = hl.vds.to_merged_sparse_mt(vds)

The error shows:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 3
      1 vds = hl.vds.read_vds('/storage1/fs1/jin810/Active/fup/UDN_CP_download_for_Nahyun/WGS_F018_withPL.vds')
----> 3 sparse_mt = hl.vds.to_merged_sparse_mt(vds)
      7 # sparse_mt.filter_rows(sparse_mt.locus==hl.locus("chr1", 11240432, reference_genome='GRCh38')).entries().show()

File <decorator-gen-1614>:2, in to_merged_sparse_mt(vds, ref_allele_function)

File /usr/local/lib/python3.8/dist-packages/hail/typecheck/check.py:577, in _make_dec.<locals>.wrapper(__original_func, *args, **kwargs)
    574 @decorator
    575 def wrapper(__original_func, *args, **kwargs):
    576     args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577     return __original_func(*args_, **kwargs_)

File /usr/local/lib/python3.8/dist-packages/hail/vds/methods.py:159, in to_merged_sparse_mt(vds, ref_allele_function)
    157         info("to_merged_sparse_mt: using locus sequence context to fill in reference alleles at monomorphic loci.")
    158     else:
--> 159         raise ValueError("to_merged_sparse_mt: in order to construct a ref allele for reference-only sites, "
    160                          "either pass a function to fill in reference alleles (e.g. ref_allele_function=lambda locus: hl.missing('str'))"
    161                          " or add a sequence file with 'hl.get_reference(RG_NAME).add_sequence(FASTA_PATH)'.")
    162 ht = ht.select(
    163     alleles=hl.coalesce(ht['alleles'], hl.array([ref_allele_function(ht)])),
    164     # handle cases where vmt is not keyed by alleles
    165     **{k: ht[k] for k in vds.variant_data.row_value if k != 'alleles'},
    166     _entries=merge_arrays(ht['_ref_entries'], ht['_var_entries'])
    167 )
    168 ht = ht._key_by_assert_sorted('locus', 'alleles')

ValueError: to_merged_sparse_mt: in order to construct a ref allele for reference-only sites, either pass a function to fill in reference alleles (e.g. ref_allele_function=lambda locus: hl.missing('str')) or add a sequence file with 'hl.get_reference(RG_NAME).add_sequence(FASTA_PATH)'.

So, I add ref_allele_function=lambda locus: hl.missing('str') to hl.vds.to_merged_sparse_mt and it seems works fine. But I wonder the reason we need to add this option after we config the hl.vds.new_combiner?

## with PL
vds = hl.vds.read_vds('CP-F0260_withPL.vds')
sparse_mt = hl.vds.to_merged_sparse_mt(vds, ref_allele_function=lambda locus: hl.missing('str'))

sparse_mt.filter_rows(sparse_mt.locus==hl.locus("chr1", 11240432, reference_genome='GRCh38')).entries().show()

Although, The result shows this config assign the homozygous reference PLs to [0], not the scores [0,120,1800] as GATK genotypegvcf result and gvcf.

Then, I try to test the de_novo() function with this configured joint-called VDS, and see if I can get the positive from it.
But when I try to split multi-allele, the PL score will turn to missing again.

So, is there a way we can pass homozygous reference PLs through the hl.vds.split_multi or hl.experimental.sparse_split_multi?

Or can you help me with how to re-assign PL scores from GVCF to VQSR VCF? Thanks!

hail-team · March 31, 2023, 12:10pm

I see the problem – Hail is discarding the NON_REF allele and associated PLs. I think the solution isn’t quite clear, so let me think about this over the weekend. Sorry this is causing so many problems for you!

poyingfu · March 31, 2023, 5:12pm

Really appreciate your help! Have a nice weekend!

poyingfu · April 1, 2023, 3:45am

Hi Hail team,

I found a way re-assign the PL scores back to VQSR VCF, and then, run hl.de_novo().
The result table shows that hl.de_novo() worked great, as I was able to identify all the true variants.

Really thanks for helping this!

Best,
Po-Ying

Drdreammaerd · April 26, 2023, 9:15pm

Dear Hail Team,

This is David, at the same team as Po-Ying, who recently left her position. I am reaching out to follow up this issue. We have found that while we can successfully use hl.denovo by re-assigning the PL from their related g.vcf file in whole exome data, we are encountering difficulties in the whole genome data. I was wondering if you could suggest any methods, such as regenerating the PL and GQ score (like GenotypeGVCFs), or an updated version of the combiner that we can try?

Thank you for your time and help.

Bests,

David

Drdreammaerd · April 27, 2023, 4:52pm

Dear Hail Team,

I have conducted further investigations on the issue at hand and used the joint VCF file by GenomicDBImport followed by GenotypeGVCF as a positive control to perform the hl.denovo test. I tested on both WES and WGS data.

The reason why we were able to successfully re-assign the PL value from each individual g.vcf file is seems because when we performed HaplotypeCaller for WES, we applied ‘BP_RESOLUTION.’ However, for WGS data, we typically do not use this option as it generates too much unwanted information, resulting in some g.vcf files not having information for specific variants in certain samples. As a result, we cannot re-assign the PL value for those variants in that sample.

However, using GenomicDBImport and GenotypeGVCF, we can get all the variants information for each joint sample. I assume that is because this process recalculate the PL and GQ score back to each variant (Please correct me if I am wrong).

For example, for the variant located at chr1 43323393, we could only retrieve specific information from the g.vcf files for the proband and mother, and re-assign it back to the joint matrix table using hail.vds.combiner. However, the joint file generated by GenomicDBImport and GenotypeGVCF could have all the information for each sample.

Gvcf file

Screen Shot 2023-04-27 at 11.32.17 AM880×157 20.9 KB
Hail-joint matrix table entries information

Screen Shot 2023-04-27 at 11.31.21 AM873×83 5.27 KB
GenomicDBImport entries information

Screen Shot 2023-04-27 at 11.31.08 AM880×88 5.38 KB

On the other hand, for the variant at chr1 63771304, we could obtain specific information for all samples, allowing us to re-assign all PL values and successfully identify it in hl.denovo.

Gvcf file

Screen Shot 2023-04-27 at 11.36.24 AM883×249 36.7 KB
Hail-joint matrix table entries information

Screen Shot 2023-04-27 at 11.34.02 AM873×83 5.76 KB
GenomicDBImport entries information

Screen Shot 2023-04-27 at 11.35.17 AM880×97 5.67 KB

I hope this information provides more clarity and insight into the issue at hand.

Thank you for your support.

Best regards,

David

Topic		Replies	Views
PL is not passed in haploid calls. After combining gVCF in LPL field is empty array [] Hail Query & hailctl	0	312	May 4, 2023
Help for finding rare variants for 100 patients Hail Query & hailctl	10	442	March 20, 2023
De novo calls on hemizygous X variants Science	20	1192	September 16, 2022
Possible vcf_combiner issue Hail Query & hailctl	19	1242	June 15, 2020
LPL fields in Dragen msVCF Science	0	367	August 8, 2023

Miss PL score on hom-ref genotype after run_combiner step

Related topics