I’m importing and annotating a vcf with just one chromosome, and I get the following message when annotating (with annotate_variants_table):
Hail: INFO: Ordering unsorted dataset with network shuffle
The vcf file is ordered by position, and so is the annotation table. Does this mean that Hail performs an ordering somewhere when annotating?
Hail does ordered joins almost everywhere, which means that both sides need to be ordered. This message is probably referring to the shuffle that orders the annotation table.
Thank you so much for the very fast reply! But then, does this mean that even if the annotation table is ordered beforehand, a shuffle will be triggered anyway? Is there a way to avoid having this shuffles on already sorted data?
no, if it’s ordered then there shouldn’t be a shuffle in annotate_variants_table.
I do recall that filter_variants_table may always do a shuffle, though.
Ok thanks! Last question though (maybe silly, but just in case), the right ordering is by chromosome and then position, right? Or is there any other factor that must be considered?
I’m asking also because depending on the vcf that I import, I get either this message:
INFO: Coerced sorted dataset
Or this one:
INFO: Coerced almost-sorted dataset
Even if I have previously sorted both vcf with bedtools sort.
Ordering is actually by the full variant (
TVariant) in 0.1. This means if you have a multiallelic variant split over two lines, it could be out of order. In this case, it won’t trigger a full shuffle but a local sort (
Coerced almost-sorted dataset)