Hi All,
I have a vcf.bgz file and it is around 300 G. I would like to convert it to VDS, However, Hail generated very big files during writing. There are some examples:
hdfs dfs -du -s -h /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt*/*
/applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120347_0054_m_000041_1/part-00041-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120347_0054_m_000052_0/part-00052-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120348_0054_m_000015_3/part-00015-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120348_0054_m_000045_1/part-00045-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120348_0054_m_000048_1/part-00048-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120349_0054_m_000010_3/part-00010-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120350_0054_m_000053_0/part-00053-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120351_0054_m_000018_3/part-00018-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120352_0054_m_000017_3/part-00017-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120352_0054_m_000021_3/part-00021-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120353_0054_m_000022_3/part-00022-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 3 T /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120357_0054_m_000054_0/part-00054-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 0 /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120358_0054_m_000055_0/part-00055-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
0 0 /applications/genomic/data/dataIngestion/gttra/gt_merge/gt_refine2.vds/rdd.parquet/_temporary/0/_temporary/attempt_20170901120359_0054_m_000056_0/part-00056-07c3a274-636d-4fd4-b80e-7a21897fb6c6.snappy.parquet
The dynamic usage of disk is following:
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 6.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 9.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 9.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 9.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 12.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 39.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 57.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 57.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 57.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 57.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 51.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 57.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 54.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 54.8 T /applications/genomic
[jt@poc ~]$ hdfs dfs -du -s -h /applications/genomic
2.3 T 6.8 T /applications/genomic
Could you please give some suggestions? Should I change the parquet compression format?
Kind regards,
Jiaowei