Hail 0.1 did not work well with genetic datasets that
were not aligned to b37. The contig sort order was hard-coded as 1-22, X, Y, MT
. Any additional contigs not in this list were sorted
alphabetically. This caused datasets with contigs that contained chr
prefixes such as hg38 to be shuffled on import. In addition, the sex
chromosomes and mitochondrial contig name were hardcoded to be X
, Y
,
and MT
and the coordinates of the pseudoautosomal region on the X
and Y
chromosomes were hardcoded for b37. Expr functions such as is_autosomal
,
in_par_x
, etc. used these hard-coded values.
In 0.2, we have fixed these problems with a number of changes explained in detail below.
ReferenceGenome
We have added a new Python class called ReferenceGenome
which contains
information about the contig names, lengths, and order. It also stores
which contigs should be treated as X
, Y
, and mitochondrial
chromosomes and the locations of pseudoautosomal regions.
We have defined three built-in references in Hail: GRCh37
, GRCh38
, and default
.
GRCh37
corresponds to 'b37'
from GATK and is derived from the FASTA file available
from the GATK resource bundle. GRCh38 corresponds to 'hg38'
from the GATK reference bundle.
default
is an alias for GRCh37
currently, but will change to GRCh38
in the future.
To obtain the Python ReferenceGenome
objects for the built-in references, use
hl.get_reference
.
>>> grch37 = hl.get_reference('GRCh37')
>>> grch38 = hl.get_reference('GRCh38')
In addition, you can define your own reference by creating a new ReferenceGenome
object. For example, the following code defines a fake reference genome with three
contigs: a
, b
, and c
with lengths equal to 50
, 100
, and 75
.
>>> tiny_rg = ReferenceGenome('tiny', ['a', 'b', 'c'], {'a': 50, 'b': 100, 'c': 75})
The first argument defines the reference name (cannot be ‘GRCh37’, ‘GRCh38’, or ‘default’).
The second argument defines the contig names and their order. The third argument
is a map specifying the length of each contig.
To have Hail treat the contig ‘c’ analogous to a Y
chromosome (males have one copy
and females have 0 copies), modify the command above as follows:
>>> tiny_rg = hl.ReferenceGenome('tiny',
['a', 'b', 'c'],
{'a': 50, 'b': 100, 'c': 75},
y_contigs=['c'])
To save the newly defined reference for future use, export the data in JSON format using the
write
method. This file can be read back into Hail with read
. A reference genome can also be created from a FASTA file using the method from_fasta_file
.
Be aware that in any one Hail session, there can only be one reference with a given name.
Locus
A locus in Hail 0.2 is parameterized by the reference it belongs to. This is
true of the type tlocus
, the LocusExpression
, and the local Python Locus
object.
The locus contig must be in its reference genome and the position must be
in the range of 1-contig_length
as defined by the reference genome. Otherwise,
Hail will throw an error.
The following are examples of locus expressions with coordinates consistent with the
reference genome:
>>> hl.locus('1', 4352435, reference_genome='GRCh37')
>>> hl.locus('chr1', 4352435, reference_genome='GRCh38')
Valid inputs for the argument reference_genome
can be either a string
with the name of the reference genome or a Python ReferenceGenome
object.
Methods on LocusExpression
like is_autosomal
and in_mito
use its reference
genome to compute the result. For example, given a locus parameterized by tiny_rg
defined above, a locus on contig ‘c’ will not be autosomal:
>>> l = hl.locus('c', 23, reference_genome=tiny_rg)
>>> l.in_autosome()
False
Data Import
All of Hail’s data import methods that return a row key field with the locus
have an optional reference_genome
argument:
The behavior of these methods is if reference_genome=None
, the row field
locus
will have the type tstruct(contig=tstr, position=tint32)
. If a
reference genome is passed (by reference object or by reference name), the
type of locus
will be tlocus(reference_genome)
. For example, if
reference_genome='GRCh37'
the type will be tlocus(GRCh37)
.
If reference_genome
is defined, the locus contig and position will be
checked against the reference to ensure it is a valid locus (if not valid,
Hail will throw an exception). Therefore, to import data where the reference
is unknown or the contigs and positions are inconsistent with the target
reference, use reference_genome=None
and then Hail expression functions
to modify the inputs (see below).
In addition, there is an optional contig_recoding
argument that allows you
to rename contigs before they are checked against the reference genome specified.
This is useful when importing data where the contig names are slightly different
such as 01
instead of 1
for the UK Biobank data.
Importing locus intervals is analogous to importing datasets
except the type of interval
when reference_genome
is defined is
tinterval(tlocus(reference_genome))
and the type when reference_genome=None
is tinterval(tstruct(contig=tstr, position=tint32))
:
If importing a table
that contains locus or locus intervals, you can specify
the type with the types
parameter. For example, to import a table where the field locus
has the reference genome ‘GRCh38’, I can do the following:
>>> t = hl.import_table('/my/file', types={'locus': hl.locus('GRCh38')})
Expression Language
We have added functions to the expression language that check whether coordinates
are consistent with a particular reference and construct locus and locus intervals.
The functions is_valid_contig
and is_valid_locus
test whether a contig and locus are consistent with a given reference genome.
>>> hl.is_valid_contig('1', 'GRCh37').value
True
>>> hl.is_valid_locus('chr1', 0, 'GRCh38').value
False
The functions locus
, parse_locus
, locus_interval
, and parse_locus_interval
allow you to construct a new locus
or locus interval. They will throw an error if the coordinates
given are not consistent with the reference genome given.
>>> hl.locus('chr1', 5000, reference_genome='GRCh38')
>>> hl.parse_locus('1:56783', reference_genome='GRCh37')
>>> hl.locus_interval('1', 501, 23424, reference_genome='GRCh37')
>>> hl.parse_locus_interval('chr22', reference_genome='GRCh38')
FASTA Sequence Lookup
Hail now allows users to get the reference sequence context at a given locus and liftover
locus or locus intervals to a different reference genome.
The first step is to add the reference sequence to the corresponding reference
genome from both a FASTA file and FASTA index file.
>>> grch37_fasta = 'gs://hail-common/references/human_g1k_v37.fasta.gz'
>>> grch37_fai = 'gs://hail-common/references/human_g1k_v37.fasta.fai'
>>> grch37.add_sequence(grch37_fasta, grch37_fai)
Now given a locus expression with type tlocus(GRCh37)
, we can get
the reference base at that position as well as bases before and after
the position:
>>> l = hl.locus('1', 1034245)
>>> l.sequence_context().value
"G"
>>> l.sequence_context(before=5).value
"ACTCGG"
If locus
is of type tstruct(contig=tstr, position=tint32)
such
as when importing a dataset with reference_genome=None
, we can use the function get_sequence
:
>>> l = hl.struct(contig='1', position=1034245)
>>> hl.get_sequence(l.contig, l.position, reference_genome='GRCh37').value
"G"
LiftOver
Hail now supports lifting over loci from one reference genome to another.
For example, to liftover a locus from ‘GRCh37’ to ‘GRCh38’, we first load the corresponding chain file and then use the liftover
function to translate the coordinates:
>>> chain_file = 'gs://hail-common/references/grch37_to_grch38.over.chain.gz'
>>> grch37.add_liftover(chain_file, 'GRCh38')
>>> l = hl.locus('1', 1034245, 'GRCh37')
>>> hl.liftover(l, 'GRCh38').value
Locus(contig='chr1', position=1098865, reference_genome='GRCh38')
Modifying the Default Reference
If you want to change what 'default'
refers to, you can pass either 'GRCh37'
or 'GRCh38'
to init
.
hl.init(reference_genome='GRCh38')