Aha. Yeah, this won’t really work as written -
union_cols requires copying the left and the right data, so this is actually quadratic in the number of VCFs when done iteratively as you’re trying!
Another likely problem is that union_cols is filtering to common sites – if your VCFs are single-sample VCFs that record only the variants each sample had a mutation at, then you’ll end up with no sites at the end (every site will be hom-ref in somebody). Is this the right categorization of your data?
You could get down to N log2(N) complexity by unioning in a tree structure, but this will also be pretty slow. The right solution is a gVCF importer, which one of our engineers, Chris, is working on. This won’t be ready for several months, though.
It may also be possible to use
to_matrix_table, which will be linear, but will require shuffling all the data over the network, which could be very slow.