I want to export the genotypes of an aggregated file using this code:
data=data.key_rows_by(variant=hl.variant_str(data_filtered.locus,data_filtered.alleles))
data.GT.export(“fileName.tsv”)
is really slow… about 15 minutes in a very big file with thousand of samples. What can I do to improve the perfomance?. I need the locus, the alleles and the genotypes in a file.
We’ll need the hail log and the full script you ran to fully diagnose.
Exporting to an uncompressed TSV is generally slow. You might try exporting as filename.tsv.bgz
. Also, exporting a single file requires a slow concatenation step, you might try parallel=True
instead if you can deal with many separate files of genotypes.
Thank you,
I don’t find the option parallel in the export of a field.
Heh. You’re right. I’ll ask someone to fix this. Are you able to use VCF files (export_vcf
) instead?