I am using Hail on the UK Biobank RAP and work with the pVCF bulk files.
I want to remove participants with a genotype missingness of 10% or more as part of my QC pipeline. I will only use about 200 small genomic coordinate intervals after the QC and this is the only column-level QC I am using.
I wondered what was the most efficient way to do this in Hail. I thought of concatenating all pVCF files with union_rows() bur this is quite computationally laborious.
Thanks for your help!
@barioux Have you tried
mt = hl.import_vcf(...)
intervals = [hl.parse_locus_interval(x) for x in ['1:50M-75M', '2:START-400000', '3-22']]
mt = hl.filter_intervals(mt, intervals, keep=False)
mt = mt.annotate_cols(qc_metric = ...)
mt = mt.filter_cols(...)
That should extract just the samples of interest at the intervals of interest in time proportional to the size of the intervals and the number of samples.
@danking thanks for your quick reply. This function will be helpful at the stage of subsetting the locus intervals.
I guess I am more specifically looking at how to do the sample QC by call rate. I can do this on one pVCF in the UKB, but I am not sure how to do this from the call rate across all chromosomes.
In the UKB, pVCF are split into blocks, eg for chromosome 5 there is ukb23157_c5_b1_v1.vcf.gz, ukb23157_c5_b2_v1.vcf.gz, ukb23157_c5_b3_v1.vcf.gz etc.
Can I generate a concatenated pVCF with all blocks for all chromosomes, then run the sample QC on it?