Merge two datasets: ValueError: entry types differ

Hi,

I’m trying to merge two datasets using the following command:

ALL_autosomes = set1_autosomes.union_cols(set2_autosomes)

but it throws an error as set1 only has the GP field, while set2 has GT, DS, and GP:

ValueError: entry types differ:
left: struct{GP: array}
right: struct{GT: call, ADS: array, DS: float64, GP: array}

Is it possible to tell Hail to merge on GP only?

the thing to do for now is:

ALL_autosomes = set1_autosomes.union_cols(set2_autosomes.select_entries('GP')

We should expose unify and subset options on all the union methods, though.

Oh, as simple as that! Thanks a lot :slight_smile:

If I may ask a bonus question, is there a way to perform the variant_qc with GP, by applying some threshold for the GP value maybe, or does Hail requires the GT field to be set?

perform the variant_qc with GP

The variant_qc function looks at the schema and generates a set of statistics using known fields (GT, DP, GQ, etc). But all its functionality can be expressed using annotate_rows and various aggregators.

What is the thing you’re trying to do with GP?

I’m mostly interested in AC (for major and minor allele), AN, n_het, het_freq_hwe, and p_value_hwe, which are all given by variant_qc, but I get ValueError: 'variant_qc': expect an entry field 'GT' of type 'call' I’m guessing because I only have the GP field.

How would you define AC/AN when you have GP and not GT?

By using a threshold, for example considering the genotype with a GP over 0.5 as the GT, and NA if none of the genotypes reach 0.5

OK, the best thing to do then is to use annotate_entries to add a GT field computed from the GP using your definition. Then you can use variant_qc/sample_qc without any trouble.

Thank you.

I’m struggling to see how to set GT, I get that call has no attribute ‘when’; I tried:

test = mt.annotate_entries(GT = hl.call().when(mt.GP[0] > 0.5, hl.call(0, 0, phased=False)).when(mt.GP[1] > 0.5, hl.call(0, 1, phased=False)).when(mt.GP[2] > 0.5, hl.call(1, 1, phased=False)).or_missing())

What should I use?

hl.case()
    .when(mt.GP[0] > 0.5, hl.call(0, 0))
    .when(mt.GP[1] > 0.5, hl.call(0, 1))
    .default(hl.call(1, 1)))

Oh, I see… I had not understood how hl.case() worked. Thanks a lot.