Efficient fetching of alternative sequences?

Hi, I’m currently searching for an efficient way to fetch alternative sequences from range queries on genomic variants.

Therefore, it is necessary to fetch for one individual all the variants in a certain genomic region.

Is Hail suitable for this kind of query pattern?
I’d like to use this for 10,000s of samples.


[EDIT for better explanation:]
For example, some individual might have on position 2 an A->G mutation and at position 4 a deletion of three base pairs.

Reference sequence: TATCCCGGG
Alternative sequence in range [1…6]: TGTGGG

To calculate this alternative sequence, it is necessary to fetch for one individual all the variants in a certain genomic region.

Sorry, you’re using some lingo I don’t totally understand:

fetch alternative sequences

Can you give an example of what you would want to get out of this query on VCF-like data?

For example, some individual might have on position 2 an A->G mutation and at position 4 a deletion of three bases.

Reference sequence: TATCCCGGG
Alternative sequence in range [1…6]: TGTGGG

To calculate this alternative sequence, it is necessary to fetch for one individual all the variants in a certain genomic region.

Wouldn’t this be represented as two different non-overlapping variants, one at position 2, one at position 4?

Do you have a precise definition of an “alternative sequence”?

I still don’t see what you’d want to do with VCF-like data here, sorry.

Yes. Given some individual, I need to know all its variants in a certain region.

Variant == difference to reference genome, i.e. if I know the reference genome + the variants in the region, I can calculate the individual’s DNA sequence in the region. That’s the alternative sequence I try to obtain for each individual.

So, in short I need to execute range queries for single individuals. Does this make sense to you?

Yes, got it. Hail’s typical representation of VCF-like data stores a variant-major matrix, which does not make it easy to do rapid single-sample lookups. If you want to execute a query like this for all samples, Hail can probably do that efficiently.

1 Like

Thanks for explaining. Yes, I’d like to do this for all samples in the end, but I’d need to get this batch-wise.