Problem with annotate

I have some issue when tryig to copy data from mt.variant_qc to mt.info

Context:

I have a matrix table create from VCF. I filter out some samples, then compute variant_qc. I wish to update the fields AC, AN, AF from variant_qc to info.

Code:

# Correct AC, AN, AF
mt = mt.annotate_rows(
    info = mt.info.annotate(
        AC = mt.variant_qc.AC,
        AN = mt.variant_qc.AN,
        AF = mt.variant_qc.AF
    )
)

Problem:
When I check the mt “in memory” I see the good information

mt.info.AC.show()
# +---------------+------------------------------+--------------+
# | locus         | alleles                      | <expr>       |
# +---------------+------------------------------+--------------+
# | locus<GRCh38> | array<str>                   | array<int32> |
# +---------------+------------------------------+--------------+
# | chr1:10128    | ["ACCCTAACCCTAACCCTAAC","A"] | [2578,2]     |

But when I write the mt and load it again, the fields are empty

# Write table
mt.write('s3://....genotypes.mt', overwrite=True)
# load full mt
mt_load = hl.read_matrix_table('s3://....genotypes.mt')
mt_load.info.AC.show()
# +---------------+------------------------------+--------------+
# | locus         | alleles                      | <expr>       |
# +---------------+------------------------------+--------------+
# | locus<GRCh38> | array<str>                   | array<int32> |
# +---------------+------------------------------+--------------+
# | chr1:10128    | ["ACCCTAACCCTAACCCTAAC","A"] | NA           |

I have no idea why I lost the data …
any help ?

Thanks

This is quite strange. What version are you using?

Isn’t the problem here just the nested annotate?

The problem is that after write/read, a row that was defined is now missing, I think.

I use Hail version 0.2.32-a5876a0a2853

Can you try updating and seeing if the error persists? If so, we’ll need to get some more information to debug, like the log file.

For now I found some workaround.

Some details on the context:
I had 2 matrix table obtained from different VCF. Then I use generate the union matrix table mt_union = mt_left.union(mt_right).
The resulting mt contain the info field of mt_left. the rows that are only present in mt_right have an info field empty. That is (not ideal but) expected behaviour of union().

Then I try to updates fields nested into info using
annotate(info = mt.info.annotate(...
And at that point my guess is that when mt.info is empty then the annotate function ignore the line. It is not explicitely stated for annotate() function but I remember seen mention of “NA rows are ignored” in some place in the doc

To wor around that I make a first pass when I force data into info for each rows
mt.annotate_rows( info = hl.struct(mode = 'union') )
Then I make a second pass to update multiple fields of info.
That way, there is no more empty info and all rows get updated.

That solve my imediate problem, but do not explain why in the previous post I am able to recover the data when I work “in memory” but not after a write/load of the mt.

And at that point my guess is that when mt.info is empty then the annotate function ignore the line. It is not explicitely stated for annotate() function but I remember seen mention of “NA rows are ignored” in some place in the doc

the mt.info.annotate is using StructExpression.annotate, which will return a missing value if the input is missing. In order to force a defined value, you can use hl.struct as follows:

mt = mt.annotate_rows(info = hl.struct(**mt.info, **annotation_fields))

This is the thing I’m concerned about. It has the smell of an optimizer error (where Hail’s optimizing compiler makes an invalid transformation of the pipeline to execute).