Performing sample missingness filtering on multiple pVCF files


I am using Hail on the UK Biobank RAP and work with the pVCF bulk files.

I want to remove participants with a genotype missingness of 10% or more as part of my QC pipeline. I will only use about 200 small genomic coordinate intervals after the QC and this is the only column-level QC I am using.

I wondered what was the most efficient way to do this in Hail. I thought of concatenating all pVCF files with union_rows() bur this is quite computationally laborious.

Thanks for your help!

@barioux Have you tried filter_intervals?

mt = hl.import_vcf(...)
intervals = [hl.parse_locus_interval(x) for x in ['1:50M-75M', '2:START-400000', '3-22']]
mt = hl.filter_intervals(mt, intervals, keep=False)
mt = mt.annotate_cols(qc_metric = ...)
mt = mt.filter_cols(...)
hl.export_vcf(mt, ...)

That should extract just the samples of interest at the intervals of interest in time proportional to the size of the intervals and the number of samples.

@danking thanks for your quick reply. This function will be helpful at the stage of subsetting the locus intervals.

I guess I am more specifically looking at how to do the sample QC by call rate. I can do this on one pVCF in the UKB, but I am not sure how to do this from the call rate across all chromosomes.

In the UKB, pVCF are split into blocks, eg for chromosome 5 there is ukb23157_c5_b1_v1.vcf.gz, ukb23157_c5_b2_v1.vcf.gz, ukb23157_c5_b3_v1.vcf.gz etc.

Can I generate a concatenated pVCF with all blocks for all chromosomes, then run the sample QC on it?

Thanks again!