This is a feature that will make life much easier for many of you struggling to wrangle data from Python to Google buckets or Amazon S3. It comes in the form of three utility methods in the hail
module:
These methods can be used to read from, write to, and copy data on/off any file system Hail can see in its Spark-y methods. hadoop_write
and hadoop_read
open file handles that extend the Python standard IO interface, but pass data through the JVM and the Hadoop filesystem API under the hood. Here are some examples from the docs:
with hadoop_read('gs://my-bucket/notes.txt') as f:
for line in f:
print(line.strip())
with hadoop_write('gs://my-bucket/notes.txt') as f:
f.write('result1: %s\n' % result1)
f.write('result2: %s\n' % result2)
hadoop_copy('gs://hail-common/LCR.interval_list',
'file:///mnt/data/LCR.interval_list')