Minor changes to subsetting vcf

Hi,
I was wondering if it would be easy to make couple minor changes to a subsetted vcf.

  1. The “AN”, “AC”, and “AF” INFO fields are not updated to reflect the fact that there are now fewer samples in the subsetted vcf. Is this expected? Is there an easy Hail command to update them?

  2. Hail changes floats that were originally “0.000” to “0.00000e+00". Is there a simple way for them to keep it the original way?

Thanks,
Sam

Also, does export_vcf have an option to create an index file as well?

If you want to update the info fields, you need to update mt.info. You might look at variant_qc, take a look at the warning in export_vcf.

There’s no simple way to do this. You can use hl.format to create any format string you want, but I don’t know if that breaks the type annotations in a VCF (since now you have a string rather than a float). Either representation will compress well, so I don’t expect it to be a size issue. What is the issue you’re having with scientific notation?

Hail doesn’t have any facility to create VCF indices. Most folks run tabix after exporting.

we’ll definitely write a header line that says it’s a string, so that probably won’t do.

I think there’s an issue to expose a flexible float format, but it’s low-prio.

Sam,

  1. A note on Hail philosophy: In intention, Hail is more like a general purpose data processing tool (like R or pandas) than it is like GATK. In particular, we process VCF files from lots of sources, as well as non-genotype data. While AN, etc. are relatively standard, we see lots of different schemas and didn’t feel like we could reliably write update rules for all situations like this. Therefore, Hail operations generally only do you exactly what you ask (e.g. filter_cols only filters columns, but changes no other values), and we tried to build an interface that makes it very easy to update other parts of the dataset. You should be able to update AN, etc. with 2 lines to call to variant_qc and then an annotate_rows. Let us know if you need more help with this.

  2. Changing the float format for VCF is a good suggestion. Can you open an issue on on the Hail github? https://github.com/hail-is/hail/issues

Also, we’ve had multiple requests to write an index with export_vcf and it is on our todo list (but not scheduled and probably won’t get to it until after ASHG in October at the earliest.)

Thank you for the info!

I have create a ticket in github for the float format.