Hi team !
Asking for some insight / tips
I have a large collection of gVCF files. Testing on 10,000 first with in view to scale up to 50,000 and 100,000 participants in due time. I wish to load the single sample gVCFs into VDS. I wonder if there is some guideline in term of a batch size that would be optimum.
My first approach would be to create VDS by batch of 1000 participants then combine the VDSs by pairs ~ following a binary tree approach.
some rational:
- The batch of 1000 worked well in other processing (IGG joint calling)
- The binary tree merging was efficient to create a 10,000 participants dense matrix table from gVCF using an old version of Hail.
In term of snippet:
Import gVCF into VDS
# From a list of 10,000 gVCF s3 path
ls_gvcf
# Params
batch = 0
step = 1000
# Combine gVCF
combiner = hl.vds.new_combiner(
output_path=f'${VDS_PREFIX}/SG10K_Health_${batch}.vds',
temp_path='${VDS_TMP}/',
gvcf_paths=ls_gvcf[batch:(batch+step)],
use_genome_default_intervals=True,
)
combiner.run()
# Eventually looping from 1000 to 1000 until all the batch VDS are created
Rolling union
Here was a snippet for union of matrix tables. I would update the code to combine VDS. The idea is from a list of VDS, combine 1+2, 3+4, 5+6… on rolling union until the last step merge all the data in one file.
# Create a manifest
ht_manifest.show()
+-------------------+-----------------------------------------+-------+
| bucket | key | batch |
+-------------------+-----------------------------------------+-------+
| str | str | int32 |
+-------------------+-----------------------------------------+-------+
| "precise-scratch" | "hebrardms/SG100K_r2.0/igg/batch-1.mt" | 1 |
| "precise-scratch" | "hebrardms/SG100K_r2.0/igg/batch-2.mt" | 2 |
| "precise-scratch" | "hebrardms/SG100K_r2.0/igg/batch-3.mt" | 3 |
| "precise-scratch" | "hebrardms/SG100K_r2.0/igg/batch-4.mt" | 4 |
...
###
# List of path
manifest = ht_manifest.collect()
# Get last index, add 1 to init the loop
last_idx = manifest[-1].batch + 1
n = last_idx
while(len(manifest)>1):
try:
# Pop A
m_A = manifest.pop(0)
mt_A = hl.read_matrix_table(f's3://{m_A.bucket}/{m_A.key}')
idx_A = m_A.batch
try:
# Pop B
m_B = manifest.pop(0)
mt_B = hl.read_matrix_table(f's3://{m_B.bucket}/{m_B.key}')
idx_B = m_B.batch
# Union
mt_U = mt_A.union_cols(mt_B, 'outer')
idx_U = n
path_U = union_prefix+'/'+'union-' + str(idx_U) + '_' + str(idx_A) + '-' + str(idx_B) + '.mt'
m_U = hl.Struct(**{'bucket': bucket, 'key': path_U, 'batch': idx_U})
# Save path
manifest.append(m_U)
# Write
mt_U.write(f's3://{m_U.bucket}/{m_U.key}', overwrite=True)
# Detele source file if not in initial manifest
if (idx_A >= last_idx):
bt_bucket.objects.filter(Prefix=m_A.key).delete()
if (idx_B >= last_idx):
bt_bucket.objects.filter(Prefix=m_B.key).delete()
# Next
n+=1
except:
print('Cannot read '+m_B.key)
break
except:
print('Cannot read '+m_A.key)
break
print('input: '+f'{last_idx:,}'+' - iterations: '+f'{n-1:,}')
I am curous to know if that approach make sense or if there is optimisation for VDS