Reference genomes in 0.2

Hail 0.1 did not work well with genetic datasets that
were not aligned to b37. The contig sort order was hard-coded as 1-22, X, Y, MT. Any additional contigs not in this list were sorted
alphabetically. This caused datasets with contigs that contained chr
prefixes such as hg38 to be shuffled on import. In addition, the sex
chromosomes and mitochondrial contig name were hardcoded to be X, Y,
and MT and the coordinates of the pseudoautosomal region on the X and Y
chromosomes were hardcoded for b37. Expr functions such as is_autosomal,
in_par_x, etc. used these hard-coded values.

In 0.2, we have fixed these problems with a number of changes explained in detail below.

ReferenceGenome

We have added a new Python class called ReferenceGenome which contains
information about the contig names, lengths, and order. It also stores
which contigs should be treated as X, Y, and mitochondrial
chromosomes and the locations of pseudoautosomal regions.

We have defined three built-in references in Hail: GRCh37, GRCh38, and default.
GRCh37 corresponds to 'b37' from GATK and is derived from the FASTA file available
from the GATK resource bundle. GRCh38 corresponds to 'hg38' from the GATK reference bundle.
default is an alias for GRCh37 currently, but will change to GRCh38 in the future.

To obtain the Python ReferenceGenome objects for the built-in references, use
hl.get_reference.

>>> grch37 = hl.get_reference('GRCh37')
>>> grch38 = hl.get_reference('GRCh38')

In addition, you can define your own reference by creating a new ReferenceGenome
object. For example, the following code defines a fake reference genome with three
contigs: a, b, and c with lengths equal to 50, 100, and 75.

>>> tiny_rg = ReferenceGenome('tiny', ['a', 'b', 'c'], {'a': 50, 'b': 100, 'c': 75})

The first argument defines the reference name (cannot be ‘GRCh37’, ‘GRCh38’, or ‘default’).
The second argument defines the contig names and their order. The third argument
is a map specifying the length of each contig.

To have Hail treat the contig ‘c’ analogous to a Y chromosome (males have one copy
and females have 0 copies), modify the command above as follows:

>>> tiny_rg = hl.ReferenceGenome('tiny', 
                              	 ['a', 'b', 'c'], 
                                 {'a': 50, 'b': 100, 'c': 75}, 
                                 y_contigs=['c'])

To save the newly defined reference for future use, export the data in JSON format using the
write method. This file can be read back into Hail with read. A reference genome can also be created from a FASTA file using the method from_fasta_file.

Be aware that in any one Hail session, there can only be one reference with a given name.

Locus

A locus in Hail 0.2 is parameterized by the reference it belongs to. This is
true of the type tlocus, the LocusExpression, and the local Python Locus object.
The locus contig must be in its reference genome and the position must be
in the range of 1-contig_length as defined by the reference genome. Otherwise,
Hail will throw an error.

The following are examples of locus expressions with coordinates consistent with the
reference genome:

>>> hl.locus('1', 4352435, reference_genome='GRCh37')
>>> hl.locus('chr1', 4352435, reference_genome='GRCh38')

Valid inputs for the argument reference_genome can be either a string
with the name of the reference genome or a Python ReferenceGenome object.

Methods on LocusExpression like is_autosomal and in_mito use its reference
genome to compute the result. For example, given a locus parameterized by tiny_rg
defined above, a locus on contig ‘c’ will not be autosomal:

>>> l = hl.locus('c', 23, reference_genome=tiny_rg)
>>> l.in_autosome()
False

Data Import

All of Hail’s data import methods that return a row key field with the locus
have an optional reference_genome argument:

The behavior of these methods is if reference_genome=None, the row field
locus will have the type tstruct(contig=tstr, position=tint32). If a
reference genome is passed (by reference object or by reference name), the
type of locus will be tlocus(reference_genome). For example, if
reference_genome='GRCh37' the type will be tlocus(GRCh37).

If reference_genome is defined, the locus contig and position will be
checked against the reference to ensure it is a valid locus (if not valid,
Hail will throw an exception). Therefore, to import data where the reference
is unknown or the contigs and positions are inconsistent with the target
reference, use reference_genome=None and then Hail expression functions
to modify the inputs (see below).

In addition, there is an optional contig_recoding argument that allows you
to rename contigs before they are checked against the reference genome specified.
This is useful when importing data where the contig names are slightly different
such as 01 instead of 1 for the UK Biobank data.

Importing locus intervals is analogous to importing datasets
except the type of interval when reference_genome is defined is
tinterval(tlocus(reference_genome)) and the type when reference_genome=None is tinterval(tstruct(contig=tstr, position=tint32)):

If importing a table that contains locus or locus intervals, you can specify
the type with the types parameter. For example, to import a table where the field locus has the reference genome ‘GRCh38’, I can do the following:

>>> t = hl.import_table('/my/file', types={'locus': hl.locus('GRCh38')})

Expression Language

We have added functions to the expression language that check whether coordinates
are consistent with a particular reference and construct locus and locus intervals.

The functions is_valid_contig and is_valid_locus test whether a contig and locus are consistent with a given reference genome.

>>> hl.is_valid_contig('1', 'GRCh37').value
True

>>> hl.is_valid_locus('chr1', 0, 'GRCh38').value
False

The functions locus, parse_locus, locus_interval, and parse_locus_interval allow you to construct a new locus
or locus interval. They will throw an error if the coordinates
given are not consistent with the reference genome given.

>>> hl.locus('chr1', 5000, reference_genome='GRCh38')
>>> hl.parse_locus('1:56783', reference_genome='GRCh37')

>>> hl.locus_interval('1', 501, 23424, reference_genome='GRCh37')
>>> hl.parse_locus_interval('chr22', reference_genome='GRCh38')

FASTA Sequence Lookup

Hail now allows users to get the reference sequence context at a given locus and liftover
locus or locus intervals to a different reference genome.

The first step is to add the reference sequence to the corresponding reference
genome from both a FASTA file and FASTA index file.

>>> grch37_fasta = 'gs://hail-common/references/human_g1k_v37.fasta.gz'
>>> grch37_fai = 'gs://hail-common/references/human_g1k_v37.fasta.fai'

>>> grch37.add_sequence(grch37_fasta, grch37_fai)

Now given a locus expression with type tlocus(GRCh37), we can get
the reference base at that position as well as bases before and after
the position:

>>> l = hl.locus('1', 1034245)

>>> l.sequence_context().value
"G"

>>> l.sequence_context(before=5).value
"ACTCGG"

If locus is of type tstruct(contig=tstr, position=tint32) such
as when importing a dataset with reference_genome=None, we can use the function get_sequence:

>>> l = hl.struct(contig='1', position=1034245)
>>> hl.get_sequence(l.contig, l.position, reference_genome='GRCh37').value
"G"

LiftOver

Hail now supports lifting over loci from one reference genome to another.
For example, to liftover a locus from ‘GRCh37’ to ‘GRCh38’, we first load the corresponding chain file and then use the liftover function to translate the coordinates:

>>> chain_file = 'gs://hail-common/references/grch37_to_grch38.over.chain.gz'
>>> grch37.add_liftover(chain_file, 'GRCh38')

>>> l = hl.locus('1', 1034245, 'GRCh37')
>>> hl.liftover(l, 'GRCh38').value
Locus(contig='chr1', position=1098865, reference_genome='GRCh38')

Modifying the Default Reference

If you want to change what 'default' refers to, you can pass either 'GRCh37' or 'GRCh38' to init.

hl.init(reference_genome='GRCh38')
2 Likes