Merge two datasets: ValueError: entry types differ

hhx037 · June 19, 2019, 2:44pm

Hi,

I’m trying to merge two datasets using the following command:

ALL_autosomes = set1_autosomes.union_cols(set2_autosomes)

but it throws an error as set1 only has the GP field, while set2 has GT, DS, and GP:

ValueError: entry types differ:
left: struct{GP: array}
right: struct{GT: call, ADS: array, DS: float64, GP: array}

Is it possible to tell Hail to merge on GP only?

tpoterba · June 19, 2019, 2:46pm

the thing to do for now is:

ALL_autosomes = set1_autosomes.union_cols(set2_autosomes.select_entries('GP')

We should expose unify and subset options on all the union methods, though.

hhx037 · June 19, 2019, 4:09pm

Oh, as simple as that! Thanks a lot

If I may ask a bonus question, is there a way to perform the variant_qc with GP, by applying some threshold for the GP value maybe, or does Hail requires the GT field to be set?

tpoterba · June 19, 2019, 5:12pm

perform the variant_qc with GP

The variant_qc function looks at the schema and generates a set of statistics using known fields (GT, DP, GQ, etc). But all its functionality can be expressed using annotate_rows and various aggregators.

What is the thing you’re trying to do with GP?

hhx037 · June 20, 2019, 8:49am

I’m mostly interested in AC (for major and minor allele), AN, n_het, het_freq_hwe, and p_value_hwe, which are all given by variant_qc, but I get ValueError: 'variant_qc': expect an entry field 'GT' of type 'call' I’m guessing because I only have the GP field.

tpoterba · June 20, 2019, 4:14pm

How would you define AC/AN when you have GP and not GT?

hhx037 · June 21, 2019, 8:24am

By using a threshold, for example considering the genotype with a GP over 0.5 as the GT, and NA if none of the genotypes reach 0.5

tpoterba · June 21, 2019, 10:02am

OK, the best thing to do then is to use annotate_entries to add a GT field computed from the GP using your definition. Then you can use variant_qc/sample_qc without any trouble.

hhx037 · June 21, 2019, 1:39pm

Thank you.

I’m struggling to see how to set GT, I get that call has no attribute ‘when’; I tried:

test = mt.annotate_entries(GT = hl.call().when(mt.GP[0] > 0.5, hl.call(0, 0, phased=False)).when(mt.GP[1] > 0.5, hl.call(0, 1, phased=False)).when(mt.GP[2] > 0.5, hl.call(1, 1, phased=False)).or_missing())

What should I use?

tpoterba · June 21, 2019, 2:11pm

hl.case()
    .when(mt.GP[0] > 0.5, hl.call(0, 0))
    .when(mt.GP[1] > 0.5, hl.call(0, 1))
    .default(hl.call(1, 1)))

hhx037 · June 21, 2019, 2:32pm

Oh, I see… I had not understood how hl.case() worked. Thanks a lot.

Topic		Replies	Views
Merge MTs different order of alleles and locus Hail Query & hailctl	4	488	July 16, 2020
Merging Allele and Variant Data Question Hail Query & hailctl	4	515	January 8, 2022
Aggregate GT over rows Hail Query & hailctl	2	206	January 9, 2024
'locus' (type 'locus<any>') Hail Query & hailctl	1	307	April 6, 2023
Getting the intersect between two genotype dataset Hail Query & hailctl	0	194	December 12, 2023

Merge two datasets: ValueError: entry types differ

Related topics