Annotating VCF from gnomad VCF and merge join tables in general

Hi,

I would like to annotate an SV VCF from data in gnomad SV (v4) VCF.

I’m having trouble figuring out a way on how to join the two datasets together in Hail. For example, for each, variant, I want to find all the variants in gnomad SV that has the same chr, SVTYPE and a 90% overlap in start to end POS. For the overlaps found, I’d then like to pick the one with highest AF.

The Hail table join tutorials only show simple joins where both datasets share the same key. In my scenario, the “join key” is more like a merge function and needs to be calculated for each interval pair from each dataset. In general, is this possible in Hail? An example would be much appreciated.

Tommy.

Hi Tommy,

Hail doesn’t support joining with arbitrary join predicates. We do support joins where one dataset is keyed by intervals, and two rows are joined if the key from one side is contained in the interval key of the other. But it sounds like you would need joins where both sides are keyed by intervals. That’s something we’ve wanted to add, but never had enough demand for the feature, and we don’t have the bandwidth to implement it in the near term.

I’ve never heard of anybody wanting to treat variants as intervals, only genes. Are the two datasets using different reference genomes?

Hi Patrick,

The two datasets (both on GRCh38) would be your typical structural variants VCF + Gnomad V4 SV so the join will be intervals on both sides with 90% overlap (or arbitrary overlap) + the same SVTYPE. E.g. conceptually, in SQL, it’ll be like:

-- Join on 90% overlap of either interval, start is alias for pos, assume both tables have end columns
select *
from d1
  join d2 on d1.chr = d2.chr
    and d1.svtype = d2.svtype
    and d1.start < d2.end and d1.end > d2.start  -- actual overlap
    and (  -- has more than 90% overlap on either interval
      LEAST(t1.end, t2.end) - GREATEST(t1.start, t2.start) >= 0.9 * (t1.end - t1.start)
      OR LEAST(t1.end, t2.end) - GREATEST(t1.start, t2.start) >= 0.9 * (t2.end - t2.start)
    )

Hey @patrick-schultz, so I take it the conceptual query I provided is not that common in Hail? Or at least not idiomatic and requires a workaround approach? If so, what would it look like for arbitrary join predicates?