Help with import vcf and write

Hi! I’m a new to Hail and I’m running it on the cluster as a singularity image. The problem I’m encountering is when I’m trying to import a vcf file.
I run the following commands using ipython

import hail as hl
hl.import_vcf(‘chr22.recal.vcf.gz’, force=True)

When I run this command, I get the following message:
hail.matrixtable.MatrixTable at 0x7f28428a9208

and I’m not sure where hail saved the Matrix table.

As a next option, I also run the following command:
hl.import_vcf(‘chr22.recal.vcf.gz’, force=True).write(‘’, overwrite=False)

What I get is this:
2020-08-18 20:04:14 Hail: WARN: file ‘file:chr22.recal.vcf.gz’ is 122.9G
It will be loaded serially (on one core) due to usage of the ‘force’ argument.
If it is actually block-gzipped, either rename to .bgz or use the ‘force_bgz’
[Stage 0:> (0 + 1) / 1]

And the file doesn’t get saved.

Do you have any insights as to how to get around this problem?

Thank you!

Hi Maria,

So what’s happening here with your first question is that the file isn’t getting saved at all, because you haven’t yet requested we save it. hl.import_vcf returns a MatrixTable Python object, which has a write method on it. So you could do something like:

mt =  hl.import_vcf(‘chr22.recal.vcf.gz’, force=True)

and that would write out your VCF as a MatrixTable. That’s what the

hl.import_vcf(‘chr22.recal.vcf.gz’, force=True).write(‘’, overwrite=False)

line you posted is doing.

For the second question, this depends on how your file is zipped. If it’s zipped with bgzip, then you should pass force_bgz=True instead of force=True. If it isn’t zipped with bgzip, then I don’t know of any solution. It would just have to be read in serially, which would be slow, but then once you write out to a MatrixTable you could use that in the future (by reading in the MatrixTable version with mt = hl.read_matrix_table(""))

Hi John,

Thank you so much for your prompt reply.
I typed:

I got the message:
<hail.matrixtable.MatrixTable at 0x7fd811c11b70>

And then I typed:

But then I got the message:
NameError: name ‘mt’ is not defined

Do you know why that is?

Thank again!

You have to bind the result of hl.import_vcf(‘chr22.recal.vcf.gz’,force_bgz=True) to a variable. Like:

mt = hl.import_vcf(‘chr22.recal.vcf.gz’,force_bgz=True)

That line defines mt, so it can be used in the mt.write on the next line

It worked!!! woohoo!!! Thank you!!!