Read from and write to hdfs/gs/s3 from Python

#1

This is a feature that will make life much easier for many of you struggling to wrangle data from Python to Google buckets or Amazon S3. It comes in the form of three utility methods in the hail module:

These methods can be used to read from, write to, and copy data on/off any file system Hail can see in its Spark-y methods. hadoop_write and hadoop_read open file handles that extend the Python standard IO interface, but pass data through the JVM and the Hadoop filesystem API under the hood. Here are some examples from the docs:

with hadoop_read('gs://my-bucket/notes.txt') as f:
    for line in f:
        print(line.strip())
with hadoop_write('gs://my-bucket/notes.txt') as f:
    f.write('result1: %s\n' % result1)
    f.write('result2: %s\n' % result2)
hadoop_copy('gs://hail-common/LCR.interval_list', 
            'file:///mnt/data/LCR.interval_list')
0 Likes

#2

In 0.2, hadoop_read and hadoop_write have been consolidated into hadoop_open:

with hadoop_open('gs://my-bucket/notes.txt', 'r') as f:
    for line in f:
        print(line.strip())
with hadoop_open('gs://my-bucket/notes.txt', 'w') as f:
    f.write('result1: %s\n' % result1)
    f.write('result2: %s\n' % result2)
0 Likes