Subset large vcf into multiple vcfs

Hey Sam, you probably need to allocate more memory to the Java process. How many variants are in the VCFs?

I do not remember how we arrived at the on-prem solution. It’s definitely a little complicated to securely get your GCP Service Account keys into a docker container. A VM would not have this issue. A Dataproc cluster also would not have this issue.

In general, the particular problem you’re trying to solve is not particularly well suited to Hail MatrixTable or VCFs, unfortunately. Either format requires you to read a whole row (variant) to read out any subset of samples from that row. @konradjk has some experience using Hail to perform this operation. Our medium-term plans include engineering a format that supports this use-case better, but there’s nothing usable yet.