Joining Variant Datasets - Missed variant in inner join - Outer join


#1

The HAIL “join” function provides an inner join for two variant datasets. I would like to ask for a new feature that supports outer join so that all variant sites within both datasets reported in the output.

Also, there are situations where merging a variant records is difficult, especially in multiallelic sites. Base on my observation HAIL ignores these situations and does not include such variant sites in the output. I would like to ask for a feature that correctly joins variant in that situation. Below I provide several examples showing different regions of FILE_1 and FILE_2 as well as HAIL join output file. There are more examples in below link.

https://drive.google.com/file/d/1jqZAnC4yWWzvOPwyjbr0Z7usfRotlwX3/view?usp=sharing

====================================================

Look for possition 21:9418008

FILE_1 region=21:9417998-9418018
21 9418008 . A T . . . GT
21 9418012 . T G . . . GT
21 9418016 rs4087029 T G . . . GT

FILE_2 region=21:9417998-9418018
21 9418008 . A AT . . . GT

HAIL join output region=21:9417998-9418018
====================================================

Look for possition 21:14385606

FILE_1 region=21:14385596-14385616
21 14385606 . C CTT,CTTT,A . . . GT
21 14385609 . TA T . . . GT
21 14385610 rs139914949 AT A,TT,*,ATT . . . GT

FILE_2 region=21:14385596-14385616
21 14385606 . C CTT,CTTT . . . GT
21 14385610 rs139914949 AT A,ATT,TT . . . GT

HAIL join output region=21:14385596-14385616
====================================================

Look for possition 21:14387043

FILE_1 region=21:14387033-14387053
21 14387041 . C A . . . GT
21 14387043 . GA G,GAA . . . GT

FILE_2 region=21:14387033-14387053
21 14387043 . GA AA,G . . . GT

HAIL join output region=21:14387033-14387053
====================================================

Look for possition 21:14391555

FILE_1 region=21:14391545-14391565
21 14391555 . AG A,GG . . . GT
21 14391556 rs115464252 G A,* . . . GT

FILE_2 region=21:14391545-14391565
21 14391555 . AG A . . . GT

HAIL join output region=21:14391545-14391565
====================================================

Look for possition 21:14392484

FILE_1 region=21:14392474-14392494
21 14392480 . A G . . . GT
21 14392484 . A AC,G . . . GT
21 14392485 . A AC,C . . . GT

FILE_2 region=21:14392474-14392494
21 14392484 . A AC . . . GT
21 14392485 . A AC,C . . . GT

HAIL join output region=21:14392474-14392494
21 14392485 . A AC,C -10 . . GT:AD:DP:GQ:PL
====================================================

Look for possition 21:14396727

FILE_1 region=21:14396717-14396737
21 14396727 . T C . . . GT

FILE_2 region=21:14396717-14396737
21 14396727 rs373212424 TG T . . . GT

HAIL join output region=21:14396717-14396737
====================================================


#2

Thanks for the post.

This is something that people have asked for before, but it’s certainly not a trivial problem. It’s pretty easy to write a join that joins by locus (chr, pos) and takes the union of all alleles at a position. However, what happens to the genotypes? What happens to the PL and AD fields? Can you sketch out what that would look like?


#3

Thanks for the quick reply.
By looking at examples, it seems to me that the correct genotype for each sample could be computed after joining. However, this may require computing new set of alleles in the ALT field (not just simply put them all) and all genotypes must be recomputed too. I do agree that finding correct genotypes after joining these sites require a complex logic to be implemented. But considering that genotype is the most important information and is used in many analysis, it may worth to be considered.
I am not well familiar with the statistical information in VCF files (PL, AD). There might be the case that we cannot recompute them after joining. In this case, they can be removed from the dataset and another flag could indicate removal.


Loading many datasets into single VDS?