Hi, I have been using balding_nichols_model() to generate some matrices using the reference_genome argument, but I noticed that all the variants appear to be in chromosome 1. Is it possible to generate matrices on other chromosomes? In particular, I need to generate test files to use on some workflows that use several chromosomes, and require the variant coordinates to make sense depending on the chromosome.
Thanks!
Hi @Juan_Felipe_Ortiz,
balding_nichols_model
is implemented to simply generate loci from the chosen reference genome in order, starting from 1:1
. If the number of loci being generated is less than the number of loci in chromosome 1, then they will all be in chromosome 1.
If you want to generate a dataset with loci from multiple chromosomes, but don’t want to generate as many rows as would be needed to force it to span multiple chromosomes (which you probably don’t), I can think of two options you could try:
- Change the locus field afterwards to force it to skip over the genome with a large stride, e.g.
stride = 100_000
mt = hl.balding_nichols_model(3, 100, 100_000)
mt = mt.key_rows_by(locus=hl.locus_from_global_position(mt.locus.global_position() * stride, reference_genome='GRCh38'))
(Note that in the Balding-Nichols model, the variants are all independent, so we can freely modify the genetic positions of the variants without affecting the model in any way.)
- If your test doesn’t depend on what reference genome is used, you could use your own reference with smaller contigs, e.g.
contigs = ["1", "2", "3", "4"]
lengths = {"1": 1_000, "2": 1_000, "3": 1_000, "4": 1_000}
my_ref = hl.ReferenceGenome("my_ref", contigs, lengths)
mt = hl.balding_nichols_model(3, 100, 3_000, reference_genome=my_ref)
Hope that helps!
Hi, Thanks for pointing me on the right direction.
I tried the first approach, but I was getting out of bounds of the genome, so I what I did was to calculate stride so that the variants fill the genome uniformily
n_variants = 100_000
# I need this for something else
# Use only the autosomes
chrom_lens = {f"chr{x}": hl.get_reference("GRCh38").contig_length(f"chr{x}") for x in range(1,23)}
genome_len = sum(x for x in chrom_lens.values())
stride = int(genome_len / (n_variants - 1)) # -1 just to make sure the last variant is in-bounds
mt = hl.balding_nicols_model(3, 100, n_variants, reference_genome="GRCh38")
mt = mt.key_rows_by(locus=hl.locus_from_global_position(mt.locus.global_position() * stride, reference_genome="GRCh38")
Thanks a lot!