Read from and write to hdfs/gs/s3 from Python

tpoterba · April 26, 2017, 2:28am

This is a feature that will make life much easier for many of you struggling to wrangle data from Python to Google buckets or Amazon S3. It comes in the form of three utility methods in the hail module:

These methods can be used to read from, write to, and copy data on/off any file system Hail can see in its Spark-y methods. hadoop_write and hadoop_read open file handles that extend the Python standard IO interface, but pass data through the JVM and the Hadoop filesystem API under the hood. Here are some examples from the docs:

with hadoop_read('gs://my-bucket/notes.txt') as f:
    for line in f:
        print(line.strip())

with hadoop_write('gs://my-bucket/notes.txt') as f:
    f.write('result1: %s\n' % result1)
    f.write('result2: %s\n' % result2)

hadoop_copy('gs://hail-common/LCR.interval_list', 
            'file:///mnt/data/LCR.interval_list')

tpoterba · April 6, 2019, 4:58pm

In 0.2, hadoop_read and hadoop_write have been consolidated into hadoop_open:

with hadoop_open('gs://my-bucket/notes.txt', 'r') as f:
    for line in f:
        print(line.strip())

with hadoop_open('gs://my-bucket/notes.txt', 'w') as f:
    f.write('result1: %s\n' % result1)
    f.write('result2: %s\n' % result2)

Topic		Replies	Views
Using hadoop and spark to use with hail 0.2.83 Hail Batch & General Cloud	3	749	February 22, 2022
S3 connection error Hail Query & hailctl	5	847	September 28, 2020
Unable to write matrix tables to MinIO S3 storage Hail Batch & General Cloud	1	204	March 29, 2024
Fail write it in Hail format after loading a ~1Tb bgzipped VCF Hail Query & hailctl	6	782	February 14, 2019
Error when reading a file from a google bucket in hail on a google machine Hail Query & hailctl	16	1336	December 22, 2020

Read from and write to hdfs/gs/s3 from Python

Related topics