Hl.vds.interval_coverage

Hello,
I have an issue running hl.vds.interval_coverage.

First of all, I am not sure how big the intevals in Table of intervals should be. Do you have some easy examle how to properly define intervals for interval_coverage analysis?

I tried hl.vds.interval_coverage on this interval: chr1:12345-chr1:17299.
But an error ocurred:
AttributeError Traceback (most recent call last) Cell In [23], line 1 ----> 1 hl.vds.interval_coverage(vds, gq_thresholds=(0, 10, 20), dp_thresholds=(0, 1, 10, 20, 30), intervals=table_intervals) File :2, in interval_coverage(vds, intervals, gq_thresholds, dp_thresholds, dp_field) File /usr/local/lib/python3.8/dist-packages/hail/typecheck/check.py:577, in _make_dec..wrapper(__original_func, *args, **kwargs) 574 @decorator 575 def wrapper(original_func, *args, **kwargs): 576 args, kwargs = check_all(__original_func, args, kwargs, checkers, is_method=is_method) → 577 return original_func(*args, **kwargs) File /usr/local/lib/python3.8/dist-packages/hail/vds/methods.py:835, in interval_coverage(vds, intervals, gq_thresholds, dp_thresholds, dp_field) 830 else: 831 dp_field_dict = dict() 833 per_interval = split.group_rows_by(interval=intervals[split.row_key[0]].interval_dup) \ 834 .aggregate( → 835 bases_over_gq_threshold=tuple( 836 hl.agg.filter(split.GQ >= gq_threshold, hl.agg.sum(ref_block_length)) for gq_threshold in 837 gq_thresholds), 838 **dp_field_dict 839 ) 841 interval = per_interval.interval 842 interval_size = interval.end.position + interval.includes_end - interval.start.position - 1 + interval.includes_start

120 return self.dict[item] → 122 raise AttributeError(get_nice_attr_error(self, item)) AttributeError: MatrixTable instance has no field, method, or property ‘GQ’ Hint: use ‘describe()’ to show the names of all data fields.

I tried to describe mine VariantDataset and the GQ is there.

This is when I tried to use show:

Have a good day,
Veronika

interval_coverage looks at the reference blocks, not variant calls, for information about quality per interval. What’s the schema of the reference data matrixtable?

vds.reference_data.describe()

Hi,
thank you for the clarification.

You are right. There is no GQ on the reference_data.:


Global fields:
None

Column fields:
‘s’: str

Row fields:
‘locus’: locus
‘ref_allele’: str

Entry fields:
‘END’: int32

Column key: [‘s’]
Row key: [‘locus’]


I was using combiner set like this:

combiner = hl.vds.new_combiner(
output_path=‘datasetGVCF.vds’,
temp_path=‘my-temp-bucketGVCF’,
gvcf_paths=gvcfs,
use_genome_default_intervals=True,
reference_genome=‘GRCh38’,
)

combiner.run()

Do you think it would help try to use GRCh37 (the default reference for the combiner) like is shown in the example on hail documentation - Hail | VariantDatasetCombiner? Do you think GRCh37 will have GQ unlike the GRCh38

Hello anyone,

is there any idea how to run vds combiner in a way to get Variant Dataset with reference_data part including GQ?

Have a nice day,
Veronika Bůžková

Can you paste the #FORMAT lines of the GVCF header and one data row for a reference block (with an END field in INFO)?

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description=“Represents any possible alternative allele at this location”>
##FORMAT=<ID=AD,Number=R,Type=Integer,Description=“Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed”>
##FORMAT=<ID=AF,Number=A,Type=Float,Description=“Allele fractions for alt alleles in the order listed”>
##FORMAT=<ID=DP,Number=1,Type=Integer,Description=“Approximate read depth (reads with MQ=255 or with bad mates are filtered)”>
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description=“Count of reads in F1R2 pair orientation supporting each allele”>
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description=“Count of reads in F2R1 pair orientation supporting each allele”>
##FORMAT=<ID=GOP,Number=2,Type=Float,Description=“Phred-scaled gap open penalty for insertions and deletions”>
##FORMAT=<ID=GP,Number=G,Type=Float,Description=“Phred-scaled posterior probabilities for genotypes as defined in the VCF specification”>
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=“Genotype Quality”>
##FORMAT=<ID=GT,Number=1,Type=String,Description=“Genotype”>
##FORMAT=<ID=ICNT,Number=.,Type=Integer,Description=“Counts of INDEL informative reads based on the reference confidence model”>
##FORMAT=<ID=MB,Number=4,Type=Integer,Description=“Per-sample component statistics to detect mate bias”>
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description=“Minimum DP observed within the GVCF block”>
##FORMAT=<ID=PL,Number=G,Type=Integer,Description=“Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification”>
##FORMAT=<ID=PRI,Number=G,Type=Float,Description=“Phred-scaled prior probabilities for genotypes”>
##FORMAT=<ID=PS,Number=1,Type=Integer,Description=“Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group”>
##FORMAT=<ID=SB,Number=4,Type=Integer,Description=“Per-sample component statistics which comprise the Fisher’s Exact Test to detect strand bias”>
##FORMAT=<ID=SPL,Number=.,Type=Integer,Description=“Normalized, Phred-scaled likelihoods for SNPs based on the reference confidence model”>
##FORMAT=<ID=SQ,Number=A,Type=Float,Description=“Somatic quality”

And reference INFO seems empty, am I right?