Store multiple vcfs into single MatrixTable


I have multiple VCFs files and I want to store them into a single MatrixTable locally. I want to try to use HAIL as a database, where I can anytime add new vcfs to my current dataset or load the dataset and do my statistics.
Is this possible for multiple vcfs and how?

Thank you in advance!


Hi @nonchev,

Am also currently dealing with a problem like you where I have to combine multiple VCF files amd store them in one matrix table… please do let me know as well if you find a way to do this… I guess if its one or two files, I can try the joins but the problem is I have around 200 VCF files. I can segregate them based on groups where each will have 70 odd vcf files…

I am still stuck on this problem. It seems impossible with the current version of HAIL.
I have also hundreds of VCFs, which I want to store in a single MatrixTable.
I am struggling even with the join on two vcfs with the message:
ValueError: 'MatrixTable.union_rows' expects all datasets to have the same columns. Datasets 0 and 1 have different columns (or possibly different order).
Did you menage to join two distinct vcfs? And how?

Are these VCFs all for the same individuals, but just contain different segments of the genome? Or are they all different individuals?

different individuals

If all of those VCFs had the exact same variants, you could combine them with union_cols (columns are samples, so you want union_cols, not union_rows like you have above).

Otherwise, we have the VCF Combiner, though that requires you to actually have a bunch of single sample GVCFs:

No, the variants are not necessary the same in every vcf file…

I saw the documentation of run_combiner and also tried it myself, but I am getting this exception:

Hail version: 0.2.57-582b2e31b8bd
Error summary: HailException: invalid genotype signature: expected signatures to be identical for all inputs.
/path/104250.vcf.gz: +PCStruct{AD:PCArray[PInt32],DP:PInt32,GQ:PInt32,GT:PCCall,PGT:PCCall,PID:PCString,PL:PCArray[PInt32],RGQ:PInt32,SB:PCArray[PInt32]}
/path/118008.vcf.gz: +PCStruct{AD:PCArray[PInt32],DP:PInt32,F1R2:PCArray[PInt32],F2R1:PCArray[PInt32],GQ:PInt32,GT:PCCall,PGT:PCCall,PID:PCString,PL:PCArray[PInt32],PS:PInt32,RGQ:PInt32,SB:PCArray[PInt32]}

I am not sure how I can fix this… when I use import_vcf for each vcf individually everything is fine

If you look at your VCFs, they have different genotype signatures. Like it looks like one has a F1R2 field and the other doesn.t

Ok, but the vcfs files are not standardised and I cannot expect that all of them have the exact same structure.
Is there a way just do add a dummy value like ‘null’ and just collect fields of interest like DP,GQ, GT?

The VCF combiner requires the same signature. I think if you make one VCF header file that has the union of all fields in all input GVCFs you can use that header for each file with header_file=...

Ok, I got the idea, but is there a way to select in the header, which fields to be collected and just to ignore the rest? I am concerned that maybe in the future I would get a vcf file with a new unknown field to my previous files and than I would have to rerun everything, just because this one field, which I even do not need…

I have a second question about run_combiner():
Is it just for one run to combine the given vcfs and the next time it will just overwrite the saved matrix in out_file or it can combine also the given vcfs with the old matrix in the out_file?
EDIT: I tried run_combiner, but It can only combine single sample vcfs files once and in the future I cannot add more vcfs to the saved matrixtable. correct?

Are you planning in the future to add this feature, so we can just store together different vcfs in a single matrix table, which can be updated with new information(vcfs)?