Hi, I’m currently searching for an efficient way to fetch alternative sequences from range queries on genomic variants.
Therefore, it is necessary to fetch for one individual all the variants in a certain genomic region.
Is Hail suitable for this kind of query pattern?
I’d like to use this for 10,000s of samples.
[EDIT for better explanation:]
For example, some individual might have on position 2 an A->G mutation and at position 4 a deletion of three base pairs.
Reference sequence: TATCCCGGG
Alternative sequence in range [1…6]: TGTGGG
To calculate this alternative sequence, it is necessary to fetch for one individual all the variants in a certain genomic region.
Yes. Given some individual, I need to know all its variants in a certain region.
Variant == difference to reference genome, i.e. if I know the reference genome + the variants in the region, I can calculate the individual’s DNA sequence in the region. That’s the alternative sequence I try to obtain for each individual.
So, in short I need to execute range queries for single individuals. Does this make sense to you?
Yes, got it. Hail’s typical representation of VCF-like data stores a variant-major matrix, which does not make it easy to do rapid single-sample lookups. If you want to execute a query like this for all samples, Hail can probably do that efficiently.