Out of Space when writting VDS

Hi All,
I have a vcf.bgz file and it is around 300 G. I would like to convert it to VDS, However, Hail generated very big files during writing. There are some examples:
hdfs dfs -du -s -h /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt*/*

/applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120347_0054_m_000041_1/part-00041-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120347_0054_m_000052_0/part-00052-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120348_0054_m_000015_3/part-00015-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120348_0054_m_000045_1/part-00045-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120348_0054_m_000048_1/part-00048-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120349_0054_m_000010_3/part-00010-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120350_0054_m_000053_0/part-00053-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120351_0054_m_000018_3/part-00018-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120352_0054_m_000017_3/part-00017-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120352_0054_m_000021_3/part-00021-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120353_0054_m_000022_3/part-00022-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120357_0054_m_000054_0/part-00054-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 0 /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120358_0054_m_000055_0/part-00055-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 0 /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120359_0054_m_000056_0/part-00056-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet

The dynamic usage of disk is following:
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 6.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 9.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 9.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 9.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 12.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 39.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 57.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 57.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 57.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 57.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 51.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 57.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 54.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 54.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 6.8 T /applications/genomic

Could you please give some suggestions? Should I change the parquet compression format?

Kind regards,

Jiaowei

What is the size of the finished file? If Spark fails while writing some of the chunks, then it will leave half-written shards around until the clean-up stage when the write finishes successfully.

Also, what’s the FORMAT field of this VCF look like?

Thanks for the reply. In the end, the spark job failed very fast in one minute. About the format field, here is an example, “GT:AD:DP:GQ:PGT:PID:PL”.

Kind regards,

Jiaowei

Hi again,
My problem is solved somehow. If I set “parquet_genotypes=True”, everything works well. I am not sure that this relates to make_schema function in the source code.

Kind regards,

Jiaowei

This is pretty surprising – I can’t think of any reason why using parquet_genotypes would produce smaller files or fix errors. What was the full pipeline? Was it just

hc.import_vcf('/path/to/vcf').write('/path/to/vds')

Did it involve other operations? We certainly want to fix any problem causing Hail to write out 50T of temporary files!

Thank you for your reply. I thought about problem. The reason might be that our vcf file has too many columns, more than 1000. I was trying to save all data to parquet before and it was not successful without partitioning the data. For VDS, I am
not sure if “many columns” causes problems.

Kind regards,

Jiaowei

By columns, do you mean samples? I don’t think 1000 should be a problem – Konrad K. has used Hail to work with the gnomAD exomes, a ~15T VCF with 200,000 samples!