Annotate genomic intervals (not variants) with cadd scores


I would like to create a data structure on which I add CADD scores (or any other score) to each genomic position contained on an interval. This would mean that for each position, I wouldn’t have 1 score, rather 3 scores for each possible nucleotide substitution at that position.


interval<locus<GRCh37>>	str

What would be the best way to annotate / represent this data?
My cadd table is displayed as following:

locus	ref	alt	raw_score	phred_score
locus<GRCh37>	str	str	float64	float64
1:22148737	"T"	"A"	1.45e+00	1.56e+01
1:22148737	"T"	"C"	1.47e+00	1.57e+01
1:22148737	"T"	"G"	1.45e+00	1.56e+01
1:22148738	"G"	"A"	1.30e+00	1.49e+01
1:22148738	"G"	"C"	1.25e+00	1.47e+01
1:22148738	"G"	"T"	1.26e+00	1.47e+0

In theory I could just use the “ref” field as my reference baseline, but this way I wouldn’t be accounting for some possible positions that cadd misses.

I’m running hail on my local cluster, so i think i can’t use annotation_db that is only available on google dataproc.
Any help?


you can take the cadd table, do group_by('locus').aggregate(values = hl.agg.collect(cadd.row.drop('locus')) and then join with that

thanks @tpoterba, very smooth and elegant way to represent cadd scores.

however, i can’t join the two tables because the genome table is keyed by an Interval while in the cadd table we have Locus.

ValueError: 'join': key mismatch:
    left:  [interval<locus<GRCh37>>]
    right: [locus<GRCh37>] 

I converted cadd locus to intervals (there must be a better way to do that) and tried both join and annotate. Both take long and don’t output what I want.

cadd = cadd.annotate(interval= hl.locus_interval(hl.str(":")[0], 
          ":")[1]) + 1)).key_by('interval')

fasta_new = fasta.annotate(cadd=cadd[fasta.interval].values)

interval	cadd
interval<locus<GRCh37>>	array<struct{values: array<struct{values: array<struct{values: array<struct{values: array<struct{values: array<struct{values: array<struct{values: array<struct{ref: str, alt: str, raw_score: float64, phred_score: float64}>}>, interval: interval<str>}>, interval: interval<str>}>, interval: interval<locus<GRCh37>>}>}>, interval: interval<locus<GRCh37>>}>, interval: interval<locus<GRCh37>>}>
[1:22148737-1:22263803]	NA
[1:94458390-1:94586704]	NA
[1:97543299-1:98460556]	NA
[1:100652475-1:100715390]	NA

fasta_new = fasta.join(cadd)

A more intuitive approach from a true beginner would be to create one row per nucleotide and join by locus or interval, but I don’t know if this is efficient. Additionally I would need to find a way to update the key (interval), adding 1 on each subsequent row. :weary:

fasta_new = fasta.explode(fasta.sequence)

interval	gene_name	locus_start	after	sequence
interval<locus<GRCh37>>	str	locus<GRCh37>	int32	str
[1:22148737-1:22263803]	"HSPG2"	1:22148737	115067	"T"
[1:22148737-1:22263803]	"HSPG2"	1:22148737	115067	"G"
[1:22148737-1:22263803]	"HSPG2"	1:22148737	115067	"G"
[1:22148737-1:22263803]	"HSPG2"	1:22148737	115067	"C"
[1:22148737-1:22263803]	"HSPG2"	1:22148737	115067	"G"
[1:22148737-1:22263803]	"HSPG2"	1:22148737	115067	"G"
[1:22148737-1:22263803]	"HSPG2"	1:22148737	115067	"G

I’ve been somewhat stumped by this for a while. We have a lot of machinery for annotating loci/variants with intervals, but it’s not super easy to go the other way.

I think the right strategy is probably:

  1. annotate the CADD table with a list of all intervals that contain each site. You can do this with something like the following:
 # so that we can keep the interval key in the join
 intervals = intervals.annotate(interval_copy = intervals.interval)
 cadd = cadd.annotate(intervals = intervals.index(, all_matches=True).interval_copy)
  1. If the number of intervals is less than 250k or so do the following (which will be more efficient but require localizing the total result in memory):
aggregate_interval_stats = cadd.aggregate(
        lambda interval: hl.agg.group_by(interval, <SUMMARIZATION FUNCTION HERE>),
  1. Otherwise, do something like:
cadd = cadd.explode(cadd.intervals, name='interval')
aggregate_interval_stats = cadd.group_by(cadd.interval).aggregate(<SUMMARIZATION HERE>))

What exactly are you trying to compute per interval? We may be able to come up with a trick based on that.

I would eventually do computations over intervals. But for now, I just want to write some code in hail to join different tables by intervals and genomic locus. At the end I would have a big hail table on which I have as many annotations as I want per interval (or genomic locus, not sure the best way to represent this). I ended up doing with pandas, but I was dealing with small intervals. And still, needed to explode fasta seq and create a method to update genomic locus by one coordinate. Is there an easy way to do this on hail ? Like applying a function to specific columns? Anyway, this was the final df:

    locus	seq	cadd_score	phastcons_score
0	10:126085872	G	[['A', 0.201525, 5.864], ['C', 0.144673, 4.993...	0.028
1	10:126085873	A	[['C', -0.09779, 1.421], ['G', -0.075886, 1.64...	0.030
2	10:126085874	C	[['A', -0.0034590000000000003, 2.571], ['G', -...	0.014
3	10:126085875	A	[['C', 0.13156700000000002, 4.784], ['G', 0.15...	0.009
4	10:126085876	T	[['A', 0.281948, 6.976], ['C', 0.2988570000000...	0.014
5	10:126085877	T	[['A', -0.108543, 1.3230000000000002], ['C', -...	0.010
6	10:126085878	T	[['A', -0.003152, 2.575], ['C', 0.013758000000...	0.200
7	10:126085879	G	[['A', 0.217521, 6.097], ['C', 0.160669, 5.244...	0.300
8	10:126085880	A	[['C', 0.48106499999999996, 9.206], ['G', 0.50...	0.762
9	10:126085881	A	[['C', 0.542058, 9.764], ['G', 0.563963, 9.956...	0.811

In my case, I don’t expect to have a list of intervals that overlap each other, thus each site should either overlap with one interval or none. But indeed, this is useful when multiple overlapping intervals are under analysis (e.g. coordinates of sense and antisense transcripts).

What I would like is to avoid to explode the fasta sequence table and create a large amount of new rows to represent those intervals. Instead, if I have let’s say variant sites belonging to an interval, It would be nice to have some mechanism to annotate these intervals with all the variants that belong there (in the same row). If it is not possible, a table where each row represents a single position should suffice.

Almost certainly, yes – Hail can do most of what Pandas can do. Do you have a specific example?

In this case, leave out the all_matches=True argument – this will return a single value (or missing if no interval contains that locus) and make things easier to process.

I agree, this would be a great addition. I don’t think we’ll see it in the next few months since it’s not trivial, but certainly this is a feature we should add (especially as we think about building out bedtools-like functionality).

The code is in this gist. The way I update fasta locus is in the process_poss function.

Thanks, will keep an eye on it.

here’s an example:

In [3]:
| locus         | base_str  |
| locus<GRCh37> | str       |
| 1:100         | "ATGGGCA" |

In [4]: ht = ht.annotate(range = hl.range(hl.len(ht.base_str)))

In [6]: ht = ht.explode(ht.range)

In [7]: ht = ht.transmute(locus =, + ht.range), 

In [8]:
| locus         | base |
| locus<GRCh37> | str  |
| 1:100         | "A"  |
| 1:101         | "T"  |
| 1:102         | "G"  |
| 1:103         | "G"  |
| 1:104         | "G"  |
| 1:105         | "C"  |
| 1:106         | "A"  |

oh great!! thank you @tpoterba