Support for Copy Number (CN)


Presently, hail supports the vcf formats GT:AD:DP:GQ:PL. It would be nice if it could also support copy number (CN) also which is part of the VCF specs. GenomeSTRiP for example generates CN in place of GT for copy numbers. A homozygous deletion starts at 0 and can range up to the number of CNs.

#FORMAT=<ID=CN,Number=1,Type=Integer,Description=“Copy number genotype for imprecise events”>
##FORMAT=<ID=CNQ,Number=1,Type=Float,Description=“Copy number genotype quality for imprecise events”>

Here is a paper on the CNVs generated by GenomeSTRiP in 1000 genomes.



Hi John! We actually spoke with Bob Handsaker about this very question on Monday. While we don’t have any specific plans to build CNV analysis models anytime soon, we are planning to generalize the VCF format field that Hail reads. This would allow us to import CNV VCFs and operate on them with any of the general infrastructure we have now.