Export variants to a tsv file


I want to export all variants from a MT to a txt file. (1:100:A:C is best, but I can accept 1:100 [“A”,“C”])

I tried this but failed:

mt = hl.read_matrix_table(mt_path)
# select row fields: locus and alleles
mt2 = mt.select_rows(mt.locus, mt.alleles)
# by select rows to trans mt to a table
rows_table = mt2.rows()

hail.expr.expressions.base_expression.ExpressionException: 'MatrixTable.select_rows': cannot overwrite key field 'locus' with annotate, select or drop; use key_by to modify keys.

I also tried to export info field or filters field mt.info.export('gs://path/info.tsv'), by that I also will get locus and alleles columns but I find out the sum of variants given by exported info tsv or filters tsv is not as the same as the hl.summarize_variants(mt) func gave.

by mt.info.export('info.tsv'), I count the tsv, I find 461410 (w/ header line)
by hl.summarize_variants(mt), I should have: Number of variants: 463841. They use the same MT input, should give the same variants. I feel so confused and I only have biallelic variants.

Any help? thanks a lot.

The select_rows is throwing an error here because key fields are implicitly included in any selection, so you’re selecting them twice with this. The error message could definitely be improved.

You can get the chr:pos:ref:alt with hl.variant_str.

Let’s try:

ht = mt.rows()
ht = ht.key_by() # drop key
ht = ht.select(variant=hl.variant_str(ht.locus, ht.alleles))

Hi @tpoterba thanks a lot for your answer! it works great. I get the variant list as chr:pos:ref:alt.
but I find this wired dismatch:
code I ran:

ht = mt.rows()
ht = ht.key_by() # drop key
ht = ht.select(variant=hl.variant_str(ht.locus, ht.alleles))

from hl.summarize_variants(mt) I get:

Number of variants: 463841
Alleles per variant
  2 alleles: 463841 variants

but from the exported tsv I count it with wc -l, I only get: #460889 w/header line
(I only have biallelic SNVs in my MT as showed in the hl.summarize_variants(mt) report)

Why there are some variants disappeared from the exported tsv?

@shuang ,

Are you using Spark Speculation? Spark Speculation interacts badly with HDFS to sometimes lose files.

You can disable it by specifying spark.speculation as false in your Spark Conf. If you’re using hl.init, try this:

import hail as hl
hl.init(spark_conf={'spark.speculation': 'false'})

EDIT: fix syntax error

Hi @danking, thanks for your reply.

I tried it (made slight change in code hl.init(spark_conf={'spark.speculation': 'false'}, default_reference='GRCh38') unless it report type error) and I get more variants in tsv but still lost some variants.

code I ran:

import hail as hl
hl.init(spark_conf={'spark.speculation': 'false'}, default_reference='GRCh38')
mt_path = 'gs://path/autosomal.mt'
mt = hl.read_matrix_table(mt_path)
ht = mt.rows()
ht = ht.key_by() # drop key
ht = ht.select(variant=hl.variant_str(ht.locus, ht.alleles))

I get 463795 variants, but according hl.summarize_variants(mt) I should have 463841

I ran with GCP dataproc and I use this to generate hail cluster:

hailctl dataproc start hail \
    --master-machine-type n1-highmem-64 \
    --master-boot-disk-size 2000 \
    --num-workers 2 \
    --num-preemptible-workers 10 \
    --preemptible-worker-boot-disk-size 300 \
    --worker-machine-type n1-highmem-64 \
    --worker-boot-disk-size 300 \
    --num-worker-local-ssds 1 \
    --region europe-west1 \
    --zone europe-west1-b \
    --project myproject \
    --max-idle pt5m \
    --scopes cloud-platform

ok, I think you’re hitting a bug for which we’re releasing a fix next week. I think you’re encountering preemptible machine failure interacting badly with GCS.

Hi @danking thanks a lot for your reply.

Btw, according to your reply, I tried with only 2 normal nodes, w/o secondary nodes, I get almost all vars but still lost 4 vars in the exported tsv, compared with its MT.

I will waiting for the new release. Thanks a lot for your help!