Missing "END" field from parsed .g.vcf files

RichardCorbett · January 14, 2023, 12:44am

Hi folks,

I have loaded about 7000 .g.vcf files into a VDS using the hl.vds.new_combiner() command.

I’m hoping to follow much of the gnomAD process to create a public dataset from these samples, but the first command from the gnomAD docs, create_last_END_positions.py is trying to run mt.select_entries("END") while there is no END field available in my VDS. I do have an END column in my starting .g.vcf files, but I don’t see it when I run describe() on my large VDS:

VDS.variant_data.describe()                                                                                                                                                                                                                                          
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
----------------------------------------
Entry fields:
    'LA': array<int32>
    'LGT': call
    'LAD': array<int32>
    'LPGT': call
    'LPL': array<int32>
    'RGQ': int32
    'gvcf_info': struct {
        BaseQRankSum: float64, 
        ExcessHet: float64, 
        InbreedingCoeff: float64, 
        MLEAC: array<int32>, 
        MLEAF: array<float64>, 
        MQRankSum: float64, 
        RAW_MQandDP: array<int32>, 
        ReadPosRankSum: float64
    }
    'DP': int32
    'GP': array<float64>
    'GQ': int32
    'MIN_DP': int32
    'PG': array<float64>
    'PID': str
    'PS': int32
    'SB': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

Maybe END is there, but I’m not accessing it properly? Alternatively, the gnomAD scripts suggest that this will help in downstream steps, but maybe it is unnecessary?

thanks for any help!

tpoterba · January 14, 2023, 1:38am

try vds.reference_data.describe()

The VDS splits reference and variant data into two separate matrixtables, which is both for efficiency (this is better for both storage and compute) and interface reasons.

Topic		Replies	Views
VDS, VCF and VQSR Hail Query & hailctl	1	220	May 7, 2024
Combining VCFs from gnomAD Hail Batch & General Cloud	3	539	October 21, 2022
Combine GVCF to VDS (variant dataset in Hail) and convert VDS to VCF Hail Query & hailctl	6	425	October 4, 2023
Possible vcf_combiner issue Hail Query & hailctl	19	1242	June 15, 2020
gVCF values in VDS reference data Hail Query & hailctl	0	28	January 22, 2025

Missing "END" field from parsed .g.vcf files

Related topics