Question about hl.de_novo()

Hi Hail team,

I have a question of what’s the expected input for this de novo calling function? And what’s the expected values of this de novo calling function per trio? Thanks!

I saw that in the GitHub page of Kaitlin Samocha’s de novo caller, it mentioned:

The variant must pass all of the filters applied by the variant caller. To accept TruthSensitivityTranche variants, use the -q flag. (Removed in 3.93)

In my understanding, I follow the GATK best practice to run VQSR, and filter out those variants which did not PASS QC from the step. Then, I use Hail’s de_novo() to call de novo from the “PASSed VQSR VCF” file.
But I got relative low numbers from this de novo caller, 810 de novo variants in 24 Trio. Also, I wonder the reason they remove -q option in the script?

Also, what is the best input for this hl.de_novo() function? Is the VCF after joint-calling or the VCF after VQSR?
And should I filter out the variants that did not PASS VQSR? For example, should I do dataset = dataset.filter_rows(hl.len(dataset.filters) == 0) before calling hl.de_novo()?

Thanks for your time.

Hey @poyingfu !

Unfortunately, the Hail team isn’t well equipped to answer these scientific questions.

Perhaps @ksamocha could comment? In general, I think you’ll have a better chance emailing Kaitlin directly to ask about how to QC your dataset before de novo calling.

1 Like

Note though that the Hail implementation of de_novo is exactly the Samocha algorithm, tested for input/output equivalence thoroughly. Neither filter out non-PASS variants explicitly, to my knowledge.

1 Like

Hi @poyingfu!

You should definitely use variants that have PASS for VQSR. That TruthSensitivityTranche flag was added for an old dataset where a collaborator was having issues with VQSR. It allowed one to consider lower quality variants. We removed it later since our variant calling pipelines got better and it wasn’t best practices.

The original script does force lines to have ‘PASS’ in the FILTER column: de_novo_scripts/de_novo_finder_3.py at bde3e40cba46e02d5b45bc4d780de7761e02aee6 · ksamocha/de_novo_scripts · GitHub

So you shouldn’t have to filter beforehand.

As for how many de novo variants you expect, it depends on (1) sample size and (2) exome/genome coverage. Your results are a bit low if you had 24 genome trios (expected 70-100/trio), but very high if these were 24 exome trios (expected 1-2/trio).