Performing sample missingness filtering on multiple pVCF files

barioux · November 9, 2023, 5:16pm

Hi,

I am using Hail on the UK Biobank RAP and work with the pVCF bulk files.

I want to remove participants with a genotype missingness of 10% or more as part of my QC pipeline. I will only use about 200 small genomic coordinate intervals after the QC and this is the only column-level QC I am using.

I wondered what was the most efficient way to do this in Hail. I thought of concatenating all pVCF files with union_rows() bur this is quite computationally laborious.

Thanks for your help!

danking · November 9, 2023, 6:32pm

@barioux Have you tried filter_intervals?

mt = hl.import_vcf(...)
intervals = [hl.parse_locus_interval(x) for x in ['1:50M-75M', '2:START-400000', '3-22']]
mt = hl.filter_intervals(mt, intervals, keep=False)
mt = mt.annotate_cols(qc_metric = ...)
mt = mt.filter_cols(...)
hl.export_vcf(mt, ...)

That should extract just the samples of interest at the intervals of interest in time proportional to the size of the intervals and the number of samples.

barioux · November 9, 2023, 8:22pm

@danking thanks for your quick reply. This function will be helpful at the stage of subsetting the locus intervals.

I guess I am more specifically looking at how to do the sample QC by call rate. I can do this on one pVCF in the UKB, but I am not sure how to do this from the call rate across all chromosomes.

In the UKB, pVCF are split into blocks, eg for chromosome 5 there is ukb23157_c5_b1_v1.vcf.gz, ukb23157_c5_b2_v1.vcf.gz, ukb23157_c5_b3_v1.vcf.gz etc.

Can I generate a concatenated pVCF with all blocks for all chromosomes, then run the sample QC on it?

Thanks again!

Topic		Replies	Views
Hail sample_qc results Hail Query & hailctl	15	449	September 7, 2022
Unable to do sample/variant QC after combining MatrixTable Hail Query & hailctl	11	418	January 8, 2023
Performing SampleQC using Hail on ~500k WES samples Hail Query & hailctl	7	533	July 12, 2023
Exporting Hail MT to VCF - Missing Genotypes Hail Query & hailctl	11	251	May 8, 2024
Trying to annotate vcf subset and then filter according to properties Hail Query & hailctl	9	53	March 21, 2025

Performing sample missingness filtering on multiple pVCF files

Related topics