Export_plink with generic struct locus

Hi!
Having read in the documentation for export_plink that the locus row-key can be of type tstruct (with two fields: contig with type tstr and position with type tint32) rather that tlocus, I was hoping that the following would work:

mt = hl.balding_nichols_model(
    1, 10, 20, reference_genome='GRCh37'
)
mt = mt.annotate_rows(
    new_locus = hl.struct(
        contig = mt.locus.contig,
        position = mt.locus.position
    )
).key_rows_by(
    "new_locus", "alleles"
).drop(
    'locus'
).rename({
    "new_locus": "locus"
})
hl.export_plink(
    mt,
    "deleteme"
)

But it gives me

ValueError: Method 'export_plink' requires row key to be two fields 'locus' (type 'locus<any>') and 'alleles' (type 'array<str>')
  Found:
    'locus': struct{contig: str, position: int32}
    'alleles': array<str>

I’m actually trying to export an mt to the plink format after having done some funky stuff to contig names, which makes my loci break convention of both GRCh38 and GRCh37. Is that possible?

Hmm. Yeah, the seems like an overly restrictive check. I’ll have it remove, but for now you could modify the impex.py file in Hail to remove this line from export_plink.

    require_biallelic(dataset, 'export_plink')

EDIT: fixed the line that needs to be removed.

The implementation does rely on the type being a locus rather than a struct (you’ll get a much worse assertion error or cast exception if you remove this line), but it shouldn’t be hard to fix.

Fix here:

Thanks so much for such quick response!

But, to be honest, for the moment I think I’ll need to lean into a roundabout solution at the level of the interface, rather than patching my installation. The funky business that I was talking about is only stripping “chr” off of the contigs from GRCh38, and I can assume that there’s only chr1-22XYM. Perhaps You’ll agree that the following would also be reasonable:

contig_recoding = {'chr' + str(i):  str(i) for i in range(1, 23)}
contig_recoding.update({'chrX': 'X', 'chrY': 'Y', 'chrM': 'MT'})

try:
    my_ref = hl.get_reference("my_ref")
except KeyError:
    grch38 = hl.get_reference('GRCh38')
    my_contigs = list(contig_recoding.values())
    my_lengths = {
        contig_recoding[c]: grch38.lengths[c] \
        for c in contig_recoding
    }
    my_x_contigs = [contig_recoding[c] for c in grch38.x_contigs]
    my_y_contigs = [contig_recoding[c] for c in grch38.y_contigs]
    my_mt_contigs = [contig_recoding[c] for c in grch38.mt_contigs]
    my_par = []
    my_ref = hl.ReferenceGenome(
        name="my_ref",
        contigs=my_contigs,
        lengths=my_lengths,
        x_contigs=my_x_contigs,
        y_contigs=my_y_contigs,
        mt_contigs=my_mt_contigs,
        par=my_par
    )

    
mt = hl.balding_nichols_model(
    1, 10, 20, reference_genome='GRCh38'
)
mt = mt.annotate_cols(
    sample_name = "sample_" + hl.str(mt.sample_idx)
).key_cols_by(
    "sample_name"
)  # col-keys need to be string

mt = mt.annotate_rows(
    new_locus = hl.locus(
        contig = hl.literal(contig_recoding)[mt.locus.contig],
        pos = mt.locus.position,
        reference_genome=my_ref
    )
).key_rows_by(
    "new_locus", "alleles"
).drop(
    'locus'
).rename({
    "new_locus": "locus"
})
hl.export_plink(
    mt,
    "deleteme"
)

Now that I think about, it actually makes sense to shame anyone who wants to mess with loci into admitting that she is defining and working with her own new reference genome :slight_smile:

I think it would also be quite reasonable to expose a way to define a new reference genome by renaming names from an existing one.