I want to export all variants from a MT to a txt file. (1:100:A:C is best, but I can accept 1:100 [“A”,“C”])
I tried this but failed:
mt = hl.read_matrix_table(mt_path)
# select row fields: locus and alleles
mt2 = mt.select_rows(mt.locus, mt.alleles)
# by select rows to trans mt to a table
rows_table = mt2.rows()
mt2.export('gs://path/mt2.tsv')
hail.expr.expressions.base_expression.ExpressionException: 'MatrixTable.select_rows': cannot overwrite key field 'locus' with annotate, select or drop; use key_by to modify keys.
I also tried to export info field or filters field mt.info.export('gs://path/info.tsv'), by that I also will get locus and alleles columns but I find out the sum of variants given by exported info tsv or filters tsv is not as the same as the hl.summarize_variants(mt) func gave.
by mt.info.export('info.tsv'), I count the tsv, I find 461410 (w/ header line)
by hl.summarize_variants(mt), I should have: Number of variants: 463841. They use the same MT input, should give the same variants. I feel so confused and I only have biallelic variants.
The select_rows is throwing an error here because key fields are implicitly included in any selection, so you’re selecting them twice with this. The error message could definitely be improved.
You can get the chr:pos:ref:alt with hl.variant_str.
Number of variants: 463841
==============================
Alleles per variant
-------------------
2 alleles: 463841 variants
==============================
but from the exported tsv I count it with wc -l, I only get: #460889 w/header line
(I only have biallelic SNVs in my MT as showed in the hl.summarize_variants(mt) report)
Why there are some variants disappeared from the exported tsv?
I tried it (made slight change in code hl.init(spark_conf={'spark.speculation': 'false'}, default_reference='GRCh38') unless it report type error) and I get more variants in tsv but still lost some variants.
ok, I think you’re hitting a bug for which we’re releasing a fix next week. I think you’re encountering preemptible machine failure interacting badly with GCS.
Btw, according to your reply, I tried with only 2 normal nodes, w/o secondary nodes, I get almost all vars but still lost 4 vars in the exported tsv, compared with its MT.
I will waiting for the new release. Thanks a lot for your help!