We produced big matrix tables in gcp using a cloudtools-generated hail cluster.
By “big” I mean 200 mil+ variants x 20k samples, though each chromosome will be smaller, of course. The vcf import => mt by chr was straightforward, but getting the data down for local hail analysis seems anything but. Specifically, rsync sort of hangs, and cp is also flaky.
See a colleague’s note below:
"Have others had any problems using the Google Cloud SDK to retrieve
data; either the version that is installed on farm4, or a locally
managed copy (running on their laptops, farm3, farm4 – a head node or
within an LSF job – an OpenStack instance, Babbage Difference Engine,
I am trying to download some very large files from a GS bucket and am
getting wildly sketchy results. It has worked, once, but I’ve been
unable to get it to go since. At first I tried gsutil cp, which was
successful in exactly one invocation. After that decided to flake out, I
tried gsutil rsync: that works on my laptop, but I have to physically
keep it awake (plus I don’t have 30+TiB to spare!). It claims to be able
to do GS-to-S3 directly, but I can’t get that to work at all (I feel
that’s more to do with our S3 than Google).
As for documenting the behaviour to help debugging, there is very little
I can say. I run the command – either in a tmux session or as part of
an LSF job – and it just sits there. It’s not logging anything, writing
anything to disk and using very little CPU; it might be using a lot of
network, but I can’t really tell. After a few hours, it finally bails
out with a massive Python stack trace, ending in an
httplib.ResponseNotReady exception, suggesting that some timeout is
occurring somewhere in the formation of the HTTP response from Google.
The conspiracy theorist in me thinks that this is a classic case of
vendor lock-in Has anyone had better success with really large files?"