[Hail 0.2] Merge two MatrixTable

Hello.

I’m new to Hail and trying to find a way to merge two MatrixTables.

I have two MatrixTables (each created from import_vcf function).
They don’t have overlapping samples, but do have overlapping variants.

Is there a way to merge two MatrixTables in Hail 0.2?

Hail 0.1 seems to have a way to join two VariantDataset, but I’d rather stay with Hail 0.2…

Awaiting any suggestions…

Thank you.

is https://hail.is/docs/devel/hail.MatrixTable.html#hail.MatrixTable.union_cols what you’re looking for?

1 Like

VariantDataset.join (0.1) and MatrixTable.union_cols (0.2) are identical.

From the union_cols() document, only rows common to both dataset will be kept…
What I want is to have a MatrixTable with union of rows & union of cols…

There is no way to do this in either 0.1 or 0.2 currently. It’s coming soon.

could you describe what the result should look like? For genotypes that appear in both? in one?

The result I’m looking for would be:

  • The resulting MatrixTable should have all genotypes appeared in any of input MatrixTables.
    (ideally without duplications, but it’s okay not deduplicated.)
  • The resulting MatrixTable have all samples appeared in any of input MatrixTables.
    (again, hopefully without duplications, but can deduplicate samples later)

Does this make sense?

Thank you.

This sounds like an outer join on rows and columns, which we’re working on and should appear in the next 4-8 weeks I think.

Has there been any progress on a Hail merge feature for VCFs or matrix tables?

I am looking for similar functionality as the bcftools merge. SV programs run on single samples. So we have 5000 vcfs for each sample and SV pipeline that we would like to merge together. We would like every variant site and sample in the resulting vcf/matrix table. There are options on how to set the filter field andmissing genotypes in bcftools merge that would also be useful.

We do have the aforementioned outer join function hl.experimental.outer_join_mt (which I’ve just now realized doesn’t appear in the docs; I’ll fix that).

We have been working on building a scalable joint calling / genotype gVCFs algorithm in Hail, which is more in line with what you’re doing, I’m guessing.

What do the SV VCFs look like?

Thanks! I found this in the documentation: hail.experimental.full_outer_join_mt. Is that the function that I should be using?

The SV programs generate a standard vcf file for each sample so there is no gVCF-like file like in the gatk pipelines. There are SV specs in VCF 4.2 (https://samtools.github.io/hts-specs/VCFv4.2.pdf). The sample SV sites are then compiled into one comprehensive list and then that site list genotyped in each sample from each samples cram individually. This step adds the homozygous ref 0/0 calls with some quality scores. The individual VCFs are then merged.

How the alt is specified varies between programs. Some will use and other abbreviations specified in the vcf specs to minimize the space required to include the sequeunce.

You could certainly implement something with that function, but I think there’s probably no easy way to join 5000 VCFs in this manner right now – it’s essentially a data transpose.

It’s possible that doing a hierarchical full_outer_join_mt will perform adequately, as long as it’s broken up into ~2 pipelines (13 merges will be required to get to 5k, by log2(5000)), so doing 6-7 levels in each should work.