Sample QC metrics


I have a merged VCF with 2500 samples and 63063 sites. We obtained Hail jar from gs://hail-common/hail-hail-is-master-all-spark2.0.2-$HASH.jar and running it on Spark 2.0.2, Scala 2.11, Java 8. I’m getting unexpected output when I run this command to obtain sample QC metrics:

qc = vds.sample_qc()

Select fields for one sample (value from sample_qc / actual values):

nNotCalled: 63054 / 5754
nHomRef: 0 / 54800
nHet: 9 / 1723

Is there something I’m missing as I’m not sure why it’s not registering the genotypes correctly.


It’s possible that your VCF has unexpected / invalid input. I’ll submit a pull request to make this easier later today, but here’s what to do for now:

vds = hc.import_vcf('your_file.vcf')

This should write out some metrics about the VCF input. Can you paste the output of the report command (which should write to standard out) here?

Thanks. Here’s the output from the above commands and the report command gives nothing:

{u'nVariants': 63063L, u'nSamples': 2500, u'nGenotypes': 157657500L}

I’ve analyzed the VCF with bcftools and haven’t seen any errors so far.

We’ve seen invalid VCFs coming out of the bcftools module that splits multiallelics, but it’s totally possible that other pieces produce invalid VCFs as well.

I’m looking into the report stuff now. Are you running in a terminal / window where you can see the JVM output as well as python output? If you’re running in a Jupyter notebook, you need to look at the terminal that the notebook server is running in.

Alternatively, you can look at the hail.log file produced by this run. Look for a line that says “while importing:”, the useful information should come after that.

EDIT: The ways we’ve seen VCFs become invalid are:

  • GQ values that don’t match the second-lowest PL
  • AD fields that don’t match the number of alleles in the variant
  • sum of AD field > DP

This is what I see when importing one of our sample VCFs:

In [1]: from hail import *; hc = HailContext()
Using Spark's default log4j profile: org/apache/spark/
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
hail: info: SparkUI:
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version devel-5ad6d2e

In [2]: vds = hc.import_vcf('src/test/resources/sample.vcf')
hail: info: No multiallelics detected.
vhail: info: Coerced sorted dataset

In [3]: vds.count()
Out[3]: {u'nGenotypes': 34600L, u'nSamples': 100, u'nVariants': 346L}

In [4]:
hail: info: while importing:
    sample.vcf  import clean

Ok, I see where this could be coming from. We are running this as part of a test with Qubole (Zeppelin-based notebook) so I’ll try to figure out how to get the logs.

In the meantime, here are some details on our FORMAT tags. We used agg to combine single-sample gVCFs (generated with Illumina Isaac pipeline) into a multi-sample VCF and since it’s not a GATK VCF, some of the typical tags are missing. Here’s an example:

GT:GQ:DP:DPF:AD:PF 0/0:60:22:.:.,.:. 0/0:100:20:.:.:.

Could any of these be an issue:

  • PL is missing
  • AD for hom ref calls is “.” in some cases but “.,.” in others
  • Is it expected that even for hom ref calls, AD for all alleles in the variant will be included? I need to check if these are lost during aggregation with agg

Somewhat related, is there a way to load multiple VCFs without having to merge them into a multi-sample VCF? As the spec says, when I provide multiple VCFs to import_vcf, all VCFs are expected to be coming from same set of samples but we have one WGS VCF per sample and wanted to see if there is way to avoid merging them beforehand.

I’ve also tested ADAM a bit and I could merge individual VCFs into a single RDD/data frame after conversion to parquet but subsequent operations seem to be slower when compared to Hail.


It sounds like AD could be the problem, but it’s hard to know for sure without seeing the report output.

We don’t really have a way to load gVCFs or multiple single-sample VCFs at the moment, unfortunately. I think the right thing to do is load gVCFs so that you can preserve the reference information in the ref blocks, but this is a ways off in Hail.