Gnomad_lof_metrics annotations and filtering data

Hi! I have annotated my data with gnomad_lof_metrics -database, but I have problems when I try to filter my data based on them, mainly because it is a dictionary with a gene as a key and an array as a value (if I understand correctly). I am able to get information on specific genes with

chr.gnomad_lof_metrics.get(‘genenamehere’).pLI.show()

However, I would like to filter my data based on those gnomad-annotations. For example, I would like to filter out genes that have pLI < 0.90. What would be the best way of doing this?
Here is the structure of gnomad_lof_metrics -annotations:

chr.gnomad_lof_metrics.describe()


Type:
dict<str, array<struct {
transcript: str,
obs_mis: int32,
exp_mis: float64,
oe_mis: float64,
mu_mis: float64,
possible_mis: int32,
obs_mis_pphen: int32,
exp_mis_pphen: float64,
oe_mis_pphen: float64,
possible_mis_pphen: int32,
obs_syn: int32,
exp_syn: float64,
oe_syn: float64,
mu_syn: float64,
possible_syn: int32,
obs_lof: int32,
mu_lof: float64,
possible_lof: int32,
exp_lof: float64,
pLI: float64,
pNull: float64,
pRec: float64,
oe_lof: float64,
oe_syn_lower: float64,
oe_syn_upper: float64,
oe_mis_lower: float64,
oe_mis_upper: float64,
oe_lof_lower: float64,
oe_lof_upper: float64,
constraint_flag: str,
syn_z: float64,
mis_z: float64,
lof_z: float64,
oe_lof_upper_rank: int32,
oe_lof_upper_bin: int32,
oe_lof_upper_bin_6: int32,
n_sites: int32,
classic_caf: float64,
max_af: float64,
no_lofs: int32,
obs_het_lof: int32,
obs_hom_lof: int32,
defined: int32,
p: float64,
exp_hom_lof: float64,
classic_caf_afr: float64,
classic_caf_amr: float64,
classic_caf_asj: float64,
classic_caf_eas: float64,
classic_caf_fin: float64,
classic_caf_nfe: float64,
classic_caf_oth: float64,
classic_caf_sas: float64,
p_afr: float64,
p_amr: float64,
p_asj: float64,
p_eas: float64,
p_fin: float64,
p_nfe: float64,
p_oth: float64,
p_sas: float64,
transcript_type: str,
gene_id: str,
transcript_level: int32,
cds_length: int32,
num_coding_exons: int32,
gene_type: str,
gene_length: int32,
exac_pLI: float64,
exac_obs_lof: int32,
exac_exp_lof: float64,
exac_oe_lof: float64,
brain_expression: str,
chromosome: str,
start_position: int32,
end_position: int32
}>>
Source:
<hail.matrixtable.MatrixTable object>
Index:
[‘row’]

Thanks!

You have to decide how to combine the pLI information across (potentially) many overlapping genes and many transcripts in each gene. If you just want to remove variants where at least one transcript in at least one gene has a pLI < 0.90:

all_transcripts = hl.flatten(mt.gonamd_lof_metrics.values())
transcripts_pLI_status = all_transcripts.map(lambda t: t.pLI < 0.9)
mt = mt.filter_rows(hl.any(transcripts_pLI_status))

transcripts_pLI_status is an array of booleans. One boolean for each transcript. The value is True when pLI for the transcript in question is less than 0.9. hl.any is true if any element of its argument (which must be an array or set) is true.

Thank you for the reply! However, I still get an error:

all_transcripts = hl.flatten(mt.gnomad_lof_metrics.values())
transcripts_pLI_status = all_transcripts.map(lambda t: t.pLI < 0.9)
mt = mt.filter_rows(hl.any(transcripts_pLI_status))

Traceback (most recent call last):

File “”, line 1, in

TypeError: any() missing 1 required positional argument: ‘collection’

Heh. Looks like Hail’s any doesn’t follow the Python standard API. I’ll fix that today.

In the meantime change hl.any(transcript_pLI_status) to hl.any(lambda x: x, transcript_pLI_status).

1 Like

Thank you so much, it worked! :slight_smile: