Read from and write to hdfs/gs/s3 from Python

tpoterba · April 26, 2017, 2:28am

This is a feature that will make life much easier for many of you struggling to wrangle data from Python to Google buckets or Amazon S3. It comes in the form of three utility methods in the hail module:

These methods can be used to read from, write to, and copy data on/off any file system Hail can see in its Spark-y methods. hadoop_write and hadoop_read open file handles that extend the Python standard IO interface, but pass data through the JVM and the Hadoop filesystem API under the hood. Here are some examples from the docs:

with hadoop_read('gs://my-bucket/notes.txt') as f:
    for line in f:
        print(line.strip())

with hadoop_write('gs://my-bucket/notes.txt') as f:
    f.write('result1: %s\n' % result1)
    f.write('result2: %s\n' % result2)

hadoop_copy('gs://hail-common/LCR.interval_list', 
            'file:///mnt/data/LCR.interval_list')

tpoterba · April 6, 2019, 4:58pm

In 0.2, hadoop_read and hadoop_write have been consolidated into hadoop_open:

with hadoop_open('gs://my-bucket/notes.txt', 'r') as f:
    for line in f:
        print(line.strip())

with hadoop_open('gs://my-bucket/notes.txt', 'w') as f:
    f.write('result1: %s\n' % result1)
    f.write('result2: %s\n' % result2)

Topic		Replies	Views
Using hadoop and spark to use with hail 0.2.83 Hail Batch & General Cloud	3	623	February 22, 2022
Unable to write matrix tables to MinIO S3 storage Hail Batch & General Cloud	1	53	March 29, 2024
Error when reading a file from a google bucket in hail on a google machine Hail Query & hailctl	16	1208	December 22, 2020
Importing hail table from remote location Hail Query & hailctl	4	614	April 6, 2020
Np.save not working on google cloud notebook Hail Query & hailctl	6	1369	November 5, 2020

Read from and write to hdfs/gs/s3 from Python

Related Topics