Unfortunately, this stage just lasts forever. For the lack of a better way to explain it, it just shows four lines of Hail: INFO: Ordering unsorted dataset with network shuffle1 without a loading bar, and it has been running for 2 days. I am writing the Hail table first as I am planning to do more downstream filtering and annotation, which ran into other issues, so I thought that it’d be better to make a checkpoint first. I’d appreciate any input to make this process more efficient!
MatrixTable.entries is memory hungry and slow. Avoid it unless you’ve substantially reduced the amount of data you have.
IIUC, you want to filter trio at a variant not entire variants. What is your ultimate goal? Will you perform some sort of analysis? Do you want a list of trios per variant?
Without knowing more about your end goal, my best recommendation is to use filter_entries:
Thanks @danking. My goal is to export a list of annotated inherited variants for a subset of genes and trios. With your method above, will you have any advice on how to annotate the trios with inherited variants for each row? It will be very helpful to have that information if I export the rows of the mt.
But I seem to be running into the same issue with four lines of Hail: INFO: Ordering unsorted dataset with network shuffle1 without a loading bar. Do you know if it is possible to troubleshoot this?
This doesn’t really work for me for some reason (memory problem?). Can I get some advice on this? Thank you very much!! If it helps, I am working on a high-performance cluster, and I initialized the memory with hl.init(log='log.log',spark_conf={'spark.driver.memory': '20g', 'spark.executor.memory': '20g'})
Thank you for your comments. I believe N was 4 in my scenario as well. I am launching a Jupyter Notebook session to use Hail, do you know how adding the environment variable work? I am not sure if adding to .bashrc will work.