Hi @jbs!
I was advised that
filter_rows()
is more efficient thanfilter_intervals()
as a way of reducing cost.
That’s not right. filter_intervals
will avoid reading any partitions that don’t overlap any of the intervals, while filter_rows
has to scan through every row and apply the filter condition (though our optimizer tries very hard to rewrite filter_rows
to filter_intervals
whenever possible). Also, that snippet only filters the variant data, not the reference data, which will likely cause downstream performance issues.
I saw here that Hail doesn’t support filtering a subset of the full 245k AoU sample set very well in the VDS. In terms of the optimal order of operations here, do you recommend filtering by samples first, writing a smaller VDS, then filtering by intervals before densifying?
You want to filter the variants first. The reason filtering samples is expensive is because a VDS (like a matrixtable) is stored “variant major”, i.e. for each variant it stores an array of all the entries for all the samples. Filtering the samples means filtering every one of those entries arrays, whereas filtering variants just requires skipping over the filtered ones. By filtering variants first, only the entries arrays for the kept variants need to be rewritten.
So in short, I think you were doing everything right before. One thought: can you run vds.reference_data.describe()
and check if there’s a global field ref_block_max_length
? That’s important for making filter_intervals
on the reference data efficient.
@chrisvittal Can you think of anything else to suggest? Is the cost for the first run in the ballpark of what you would expect?
Actually, you probably want to write out the vds after filtering before densifying. That may actually be the cause of the high cost. Does that sound right, Chris?