Hi all,
I have two tables gwas
and loci_to_gene
. loci_to_gene
contains a field locus
and a field gene
with locus
being the key. gwas
also contains locus
and I want to annotate it with also the gene.
I am doing the following:
gwas = gwas.annotate(gene=loci_to_gene.index(gwas.locus).gene)
However this marks one of the loci to have a missing gene even though that exists in loci_to_gene
. I converted both tables to DataFrames and it works fine when I do everything using pandas.
I also tried None in loci_to_gene.index(gwas.locus).gene.collect()
and it returns True
but None in loci_to_gene.collect()
returns False
Any ideas where I might be wrong? Hail version is 0.2.118-a4ca239602bb
EDIT: loci_to_gene
can contain duplicates
EDIT2: Something else I tried:
gwas_missing = gwas.filter(hl.is_missing(gwas.gene)) # this has only one row
loci_to_gene.filter(hl.literal(gwas_missing.locus.collect()).contains(loci_to_gene.locus)).show() # Prints a table with the correct gene for that locus
loci_to_gene[gwas_missing.locus].gene.collect() # This simply returns [None]
So I don’t quite get how the last two lines can return different results
EDIT3:
If I unkey the gwas
table with gwas.key_by()
, then annotate the gene name and rekey back then it works correctly…Why does the key matter!!!
@danking Do you happen to know if this is a bug?
It’s hard to comment without an executable example. Can you share a CSV or Hail Table and some Python code that demonstrates the issue? Please also include the output of .describe()
on both tables.
None in loci_to_gene.collect()
will never return True because Table.collect
returns a list of hl.Struct
.
Are any of the values for gwas.locus
missing? What’s the key on gwas
?
I can help you or fix bugs only with a simple reproducible example.
Apologies for the confusion, I ran None in loci_to_gene.gene.collect()
not None in loci_to_gene.collect()
.
I tried an older version of hail (0.2.109) and I cannot reproduce the issue. I can only reproduce it with 0.2.118. I am attaching two files for a minimal reproducible example (I had to remove a lot of columns and rows in order to hide some sensitive data but the problem persists). Reproduce with the following:
gwas = hl.read_table("gwas_filtered.ht")
loci_to_gene = hl.import_table("loci_to_gene.tsv",impute=True)
locus = hl.locus(loci_to_gene.chromosome, loci_to_gene.locus, "GRCh38")
loci_to_gene = loci_to_gene.annotate(locus=locus)
loci_to_gene = loci_to_gene.key_by("locus")
loci_to_gene = loci_to_gene.select("gene")
gwas = gwas.annotate(gene=loci_to_gene.index(gwas.locus).gene)
This has to be a bug if it only happens in some versions of Hail right?
Hi @ag14774,
Thank you for the reproducing code. I’m able to run it, and I agree the result is confusing. I’ll continue to investigate and let you know.