Table.index returning None

Hi all,

I have two tables gwas and loci_to_gene. loci_to_gene contains a field locus and a field gene with locus being the key. gwas also contains locus and I want to annotate it with also the gene.

I am doing the following:

gwas = gwas.annotate(gene=loci_to_gene.index(gwas.locus).gene)

However this marks one of the loci to have a missing gene even though that exists in loci_to_gene. I converted both tables to DataFrames and it works fine when I do everything using pandas.

I also tried None in loci_to_gene.index(gwas.locus).gene.collect() and it returns True but None in loci_to_gene.collect() returns False

Any ideas where I might be wrong? Hail version is 0.2.118-a4ca239602bb

EDIT: loci_to_gene can contain duplicates
EDIT2: Something else I tried:

gwas_missing = gwas.filter(hl.is_missing(gwas.gene)) # this has only one row
loci_to_gene.filter(hl.literal(gwas_missing.locus.collect()).contains(loci_to_gene.locus)).show() # Prints a table with the correct gene for that locus
loci_to_gene[gwas_missing.locus].gene.collect() # This simply returns [None]

So I don’t quite get how the last two lines can return different results

EDIT3:
If I unkey the gwas table with gwas.key_by(), then annotate the gene name and rekey back then it works correctly…Why does the key matter!!!

@danking Do you happen to know if this is a bug?

It’s hard to comment without an executable example. Can you share a CSV or Hail Table and some Python code that demonstrates the issue? Please also include the output of .describe() on both tables.


None in loci_to_gene.collect() will never return True because Table.collect returns a list of hl.Struct.

Are any of the values for gwas.locus missing? What’s the key on gwas?

I can help you or fix bugs only with a simple reproducible example.

Apologies for the confusion, I ran None in loci_to_gene.gene.collect() not None in loci_to_gene.collect().

I tried an older version of hail (0.2.109) and I cannot reproduce the issue. I can only reproduce it with 0.2.118. I am attaching two files for a minimal reproducible example (I had to remove a lot of columns and rows in order to hide some sensitive data but the problem persists). Reproduce with the following:

gwas = hl.read_table("gwas_filtered.ht")
loci_to_gene = hl.import_table("loci_to_gene.tsv",impute=True)
locus = hl.locus(loci_to_gene.chromosome, loci_to_gene.locus, "GRCh38")
loci_to_gene = loci_to_gene.annotate(locus=locus)
loci_to_gene = loci_to_gene.key_by("locus")
loci_to_gene = loci_to_gene.select("gene")
gwas = gwas.annotate(gene=loci_to_gene.index(gwas.locus).gene)

This has to be a bug if it only happens in some versions of Hail right?

Hi @ag14774,
Thank you for the reproducing code. I’m able to run it, and I agree the result is confusing. I’ll continue to investigate and let you know.

tracking issue: error in join · Issue #13339 · hail-is/hail · GitHub