Question about hl.de_novo()

poyingfu · February 28, 2023, 1:23am

Hi Hail team,

I have a question of what’s the expected input for this de novo calling function? And what’s the expected values of this de novo calling function per trio? Thanks!

I saw that in the GitHub page of Kaitlin Samocha’s de novo caller, it mentioned:

The variant must pass all of the filters applied by the variant caller. To accept TruthSensitivityTranche variants, use the -q flag. (Removed in 3.93)

In my understanding, I follow the GATK best practice to run VQSR, and filter out those variants which did not PASS QC from the step. Then, I use Hail’s de_novo() to call de novo from the “PASSed VQSR VCF” file.
But I got relative low numbers from this de novo caller, 810 de novo variants in 24 Trio. Also, I wonder the reason they remove -q option in the script?

Also, what is the best input for this hl.de_novo() function? Is the VCF after joint-calling or the VCF after VQSR?
And should I filter out the variants that did not PASS VQSR? For example, should I do dataset = dataset.filter_rows(hl.len(dataset.filters) == 0) before calling hl.de_novo()?

Thanks for your time.

danking · March 6, 2023, 4:40pm

Hey @poyingfu !

Unfortunately, the Hail team isn’t well equipped to answer these scientific questions.

Perhaps @ksamocha could comment? In general, I think you’ll have a better chance emailing Kaitlin directly to ask about how to QC your dataset before de novo calling.

tpoterba · March 6, 2023, 4:54pm

Note though that the Hail implementation of de_novo is exactly the Samocha algorithm, tested for input/output equivalence thoroughly. Neither filter out non-PASS variants explicitly, to my knowledge.

ksamocha · March 6, 2023, 8:19pm

Hi @poyingfu!

You should definitely use variants that have PASS for VQSR. That TruthSensitivityTranche flag was added for an old dataset where a collaborator was having issues with VQSR. It allowed one to consider lower quality variants. We removed it later since our variant calling pipelines got better and it wasn’t best practices.

The original script does force lines to have ‘PASS’ in the FILTER column: de_novo_scripts/de_novo_finder_3.py at bde3e40cba46e02d5b45bc4d780de7761e02aee6 · ksamocha/de_novo_scripts · GitHub

So you shouldn’t have to filter beforehand.

As for how many de novo variants you expect, it depends on (1) sample size and (2) exome/genome coverage. Your results are a bit low if you had 24 genome trios (expected 70-100/trio), but very high if these were 24 exome trios (expected 1-2/trio).

Topic		Replies	Views
De Novo Calling Queries Hail Query & hailctl	3	364	April 20, 2022
Miss PL score on hom-ref genotype after run_combiner step Hail Query & hailctl	12	571	April 27, 2023
De novo confidence queries Hail Query & hailctl	1	401	April 29, 2022
Creating gnomadFreq.tsv file Hail Query & hailctl	7	778	February 3, 2020
Issue with split_multi and/or split_multi_hts Hail Query & hailctl	0	450	July 22, 2022

Question about hl.de_novo()

Related topics