Out of Space when writting VDS

jiaoweitang · September 1, 2017, 10:19am

Hi All,
I have a vcf.bgz file and it is around 300 G. I would like to convert it to VDS, However, Hail generated very big files during writing. There are some examples:
hdfs dfs -du -s -h /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt*/*

/applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120347_0054_m_000041_1/part-00041-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120347_0054_m_000052_0/part-00052-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120348_0054_m_000015_3/part-00015-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120348_0054_m_000045_1/part-00045-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120348_0054_m_000048_1/part-00048-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120349_0054_m_000010_3/part-00010-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120350_0054_m_000053_0/part-00053-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120351_0054_m_000018_3/part-00018-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120352_0054_m_000017_3/part-00017-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120352_0054_m_000021_3/part-00021-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120353_0054_m_000022_3/part-00022-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120357_0054_m_000054_0/part-00054-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 0 /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120358_0054_m_000055_0/part-00055-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 0 /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120359_0054_m_000056_0/part-00056-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet

The dynamic usage of disk is following:
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 6.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 9.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 9.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 9.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 12.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 39.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 57.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 57.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 57.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 57.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 51.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 57.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 54.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 54.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 6.8 T /applications/genomic

Could you please give some suggestions? Should I change the parquet compression format?

Kind regards,

Jiaowei

tpoterba · September 1, 2017, 10:47am

What is the size of the finished file? If Spark fails while writing some of the chunks, then it will leave half-written shards around until the clean-up stage when the write finishes successfully.

Also, what’s the FORMAT field of this VCF look like?

jiaoweitang · September 1, 2017, 11:15am

Thanks for the reply. In the end, the spark job failed very fast in one minute. About the format field, here is an example, “GT:AD:DP:GQ:PGT:PID:PL”.

Kind regards,

Jiaowei

jiaoweitang · September 3, 2017, 7:44am

Hi again,
My problem is solved somehow. If I set “parquet_genotypes=True”, everything works well. I am not sure that this relates to make_schema function in the source code.

Kind regards,

Jiaowei

tpoterba · September 5, 2017, 10:07am

This is pretty surprising – I can’t think of any reason why using parquet_genotypes would produce smaller files or fix errors. What was the full pipeline? Was it just

hc.import_vcf('/path/to/vcf').write('/path/to/vds')

Did it involve other operations? We certainly want to fix any problem causing Hail to write out 50T of temporary files!

jiaoweitang · September 5, 2017, 10:13am

Thank you for your reply. I thought about problem. The reason might be that our vcf file has too many columns, more than 1000. I was trying to save all data to parquet before and it was not successful without partitioning the data. For VDS, I am
not sure if “many columns” causes problems.

Kind regards,

Jiaowei

tpoterba · September 5, 2017, 10:15am

By columns, do you mean samples? I don’t think 1000 should be a problem – Konrad K. has used Hail to work with the gnomAD exomes, a ~15T VCF with 200,000 samples!

Topic		Replies	Views
Vds.write error Help [0.1]	15	1519	September 26, 2017
Got error when writing vds with parquet_genotypes parameter Help [0.1]	2	741	September 8, 2017
Fail write it in Hail format after loading a ~1Tb bgzipped VCF Hail Query & hailctl	6	785	February 14, 2019
No space left on device - export_vcf Help [0.1]	3	1002	March 29, 2018
New hail and gnomAD, setup help is needed badly :-) Help [0.1]	19	3813	May 17, 2017

Out of Space when writting VDS

Related topics