Hi all,
I would like analyze huge non-human vcf files and build my own non-human reference genome (ReferenceGenome object), but I can not find a way to set it up in hail 0.2. It looks like no way to me to set it via hail.init() or any other method like set_reference() in somewhere. Could you please help? Thanks in advance!
This is one of the rougher edges of Hail. It’s possible to set the default reference genome to one of the built-in ones (GRCh37, GRCh38, GRCm38, CanFam3) in hl.init
, but most methods that import genetic data (import_vcf, import_bed, …) or create locus objects (hl.locus, hl.parse_locus, hl.parse_locus_interval, …) take a reference_genome
argument. See the import_vcf docs for one example.
So you can do something like:
hl.init() # use default ref genome
your_reference_genome = hl.ReferenceGenome(...)
mt = hl.import_vcf(..., reference_genome=your_reference_genome)
mt.write('...')
Once you have a MatrixTable or Table that has a Locus key parameterized by your reference genome, you don’t need to continue defining it with hl.ReferenceGenome
in every Hail session, it will be loaded every time you do hl.read_matrix_table
on the written matrix table. If you need to grab it again, you can do so with mt.locus.dtype.reference_genome
.
Hi Tim,
Thank you very much for the qiuck reply! It looks more intuitive that one can set it up in init method using ReferenceGenome rather than other place.
Yeah, agreed. This is non-trivial to implement because it’s a chicken-and-egg problem – we want to initialize Hail with a new reference genome but we need functionality from an initialized Hail session in order to create the new reference genome in the first place