I have multiple VCFs files and I want to store them into a single MatrixTable locally. I want to try to use HAIL as a database, where I can anytime add new vcfs to my current dataset or load the dataset and do my statistics.
Is this possible for multiple vcfs and how?
Am also currently dealing with a problem like you where I have to combine multiple VCF files amd store them in one matrix table… please do let me know as well if you find a way to do this… I guess if its one or two files, I can try the joins but the problem is I have around 200 VCF files. I can segregate them based on groups where each will have 70 odd vcf files…
I am still stuck on this problem. It seems impossible with the current version of HAIL.
I have also hundreds of VCFs, which I want to store in a single MatrixTable.
I am struggling even with the join on two vcfs with the message: mt1.union_rows(mt2) ValueError: 'MatrixTable.union_rows' expects all datasets to have the same columns. Datasets 0 and 1 have different columns (or possibly different order).
Did you menage to join two distinct vcfs? And how?
If all of those VCFs had the exact same variants, you could combine them with union_cols (columns are samples, so you want union_cols, not union_rows like you have above).
Ok, but the vcfs files are not standardised and I cannot expect that all of them have the exact same structure.
Is there a way just do add a dummy value like ‘null’ and just collect fields of interest like DP,GQ, GT?
The VCF combiner requires the same signature. I think if you make one VCF header file that has the union of all fields in all input GVCFs you can use that header for each file with header_file=...
Ok, I got the idea, but is there a way to select in the header, which fields to be collected and just to ignore the rest? I am concerned that maybe in the future I would get a vcf file with a new unknown field to my previous files and than I would have to rerun everything, just because this one field, which I even do not need…
I have a second question about run_combiner():
Is it just for one run to combine the given vcfs and the next time it will just overwrite the saved matrix in out_file or it can combine also the given vcfs with the old matrix in the out_file?
EDIT: I tried run_combiner, but It can only combine single sample vcfs files once and in the future I cannot add more vcfs to the saved matrixtable. correct?
Are you planning in the future to add this feature, so we can just store together different vcfs in a single matrix table, which can be updated with new information(vcfs)?