Use balding_nichols_model() to generate variants in multiple chromosomes

Hi, I have been using balding_nichols_model() to generate some matrices using the reference_genome argument, but I noticed that all the variants appear to be in chromosome 1. Is it possible to generate matrices on other chromosomes? In particular, I need to generate test files to use on some workflows that use several chromosomes, and require the variant coordinates to make sense depending on the chromosome.

Thanks!

Hi @Juan_Felipe_Ortiz,

balding_nichols_model is implemented to simply generate loci from the chosen reference genome in order, starting from 1:1. If the number of loci being generated is less than the number of loci in chromosome 1, then they will all be in chromosome 1.

If you want to generate a dataset with loci from multiple chromosomes, but don’t want to generate as many rows as would be needed to force it to span multiple chromosomes (which you probably don’t), I can think of two options you could try:

  • Change the locus field afterwards to force it to skip over the genome with a large stride, e.g.
    stride = 100_000
    mt = hl.balding_nichols_model(3, 100, 100_000)
    mt = mt.key_rows_by(locus=hl.locus_from_global_position(mt.locus.global_position() * stride, reference_genome='GRCh38'))
    
    (Note that in the Balding-Nichols model, the variants are all independent, so we can freely modify the genetic positions of the variants without affecting the model in any way.)
  • If your test doesn’t depend on what reference genome is used, you could use your own reference with smaller contigs, e.g.
    contigs = ["1", "2", "3", "4"]
    lengths = {"1": 1_000, "2": 1_000, "3": 1_000, "4": 1_000}
    my_ref = hl.ReferenceGenome("my_ref", contigs, lengths)
    
    mt = hl.balding_nichols_model(3, 100, 3_000, reference_genome=my_ref)
    

Hope that helps!

Hi, Thanks for pointing me on the right direction.
I tried the first approach, but I was getting out of bounds of the genome, so I what I did was to calculate stride so that the variants fill the genome uniformily

n_variants = 100_000
# I need this for something else
# Use only the autosomes
chrom_lens = {f"chr{x}": hl.get_reference("GRCh38").contig_length(f"chr{x}") for x in range(1,23)}
genome_len = sum(x for x in chrom_lens.values())

stride = int(genome_len / (n_variants - 1)) # -1 just to make sure the last variant is in-bounds

mt = hl.balding_nicols_model(3, 100, n_variants, reference_genome="GRCh38")
mt = mt.key_rows_by(locus=hl.locus_from_global_position(mt.locus.global_position() * stride, reference_genome="GRCh38")

Thanks a lot!