I have some issue when tryig to copy data from mt.variant_qc
to mt.info
Context:
I have a matrix table create from VCF. I filter out some samples, then compute variant_qc
. I wish to update the fields AC, AN, AF
from variant_qc
to info
.
Code:
# Correct AC, AN, AF
mt = mt.annotate_rows(
info = mt.info.annotate(
AC = mt.variant_qc.AC,
AN = mt.variant_qc.AN,
AF = mt.variant_qc.AF
)
)
Problem:
When I check the mt “in memory” I see the good information
mt.info.AC.show()
# +---------------+------------------------------+--------------+
# | locus | alleles | <expr> |
# +---------------+------------------------------+--------------+
# | locus<GRCh38> | array<str> | array<int32> |
# +---------------+------------------------------+--------------+
# | chr1:10128 | ["ACCCTAACCCTAACCCTAAC","A"] | [2578,2] |
But when I write the mt
and load it again, the fields are empty
# Write table
mt.write('s3://....genotypes.mt', overwrite=True)
# load full mt
mt_load = hl.read_matrix_table('s3://....genotypes.mt')
mt_load.info.AC.show()
# +---------------+------------------------------+--------------+
# | locus | alleles | <expr> |
# +---------------+------------------------------+--------------+
# | locus<GRCh38> | array<str> | array<int32> |
# +---------------+------------------------------+--------------+
# | chr1:10128 | ["ACCCTAACCCTAACCCTAAC","A"] | NA |
I have no idea why I lost the data …
any help ?
Thanks
This is quite strange. What version are you using?
Isn’t the problem here just the nested annotate?
The problem is that after write/read, a row that was defined is now missing, I think.
I use Hail version 0.2.32-a5876a0a2853
Can you try updating and seeing if the error persists? If so, we’ll need to get some more information to debug, like the log file.
For now I found some workaround.
Some details on the context:
I had 2 matrix table obtained from different VCF. Then I use generate the union matrix table mt_union = mt_left.union(mt_right)
.
The resulting mt contain the info field of mt_left
. the rows that are only present in mt_right
have an info field empty. That is (not ideal but) expected behaviour of union()
.
Then I try to updates fields nested into info using
annotate(info = mt.info.annotate(...
And at that point my guess is that when mt.info
is empty then the annotate function ignore the line. It is not explicitely stated for annotate()
function but I remember seen mention of “NA rows are ignored” in some place in the doc
To wor around that I make a first pass when I force data into info for each rows
mt.annotate_rows( info = hl.struct(mode = 'union') )
Then I make a second pass to update multiple fields of info.
That way, there is no more empty info and all rows get updated.
That solve my imediate problem, but do not explain why in the previous post I am able to recover the data when I work “in memory” but not after a write/load of the mt.
And at that point my guess is that when mt.info
is empty then the annotate function ignore the line. It is not explicitely stated for annotate()
function but I remember seen mention of “NA rows are ignored” in some place in the doc
the mt.info.annotate
is using StructExpression.annotate
, which will return a missing value if the input is missing. In order to force a defined value, you can use hl.struct
as follows:
mt = mt.annotate_rows(info = hl.struct(**mt.info, **annotation_fields))
This is the thing I’m concerned about. It has the smell of an optimizer error (where Hail’s optimizing compiler makes an invalid transformation of the pipeline to execute).