Large gVCF into VDS

Hi team !

Asking for some insight / tips

I have a large collection of gVCF files. Testing on 10,000 first with in view to scale up to 50,000 and 100,000 participants in due time. I wish to load the single sample gVCFs into VDS. I wonder if there is some guideline in term of a batch size that would be optimum.

My first approach would be to create VDS by batch of 1000 participants then combine the VDSs by pairs ~ following a binary tree approach.
some rational:

  • The batch of 1000 worked well in other processing (IGG joint calling)
  • The binary tree merging was efficient to create a 10,000 participants dense matrix table from gVCF using an old version of Hail.

In term of snippet:

Import gVCF into VDS

# From a list of 10,000 gVCF s3 path
ls_gvcf

# Params
batch = 0
step = 1000

# Combine gVCF
combiner = hl.vds.new_combiner(
    output_path=f'${VDS_PREFIX}/SG10K_Health_${batch}.vds',
    temp_path='${VDS_TMP}/',
    gvcf_paths=ls_gvcf[batch:(batch+step)],
    use_genome_default_intervals=True,
)

combiner.run()

# Eventually looping from 1000 to 1000 until all the batch VDS are created

Rolling union
Here was a snippet for union of matrix tables. I would update the code to combine VDS. The idea is from a list of VDS, combine 1+2, 3+4, 5+6… on rolling union until the last step merge all the data in one file.

# Create a manifest
ht_manifest.show()
+-------------------+-----------------------------------------+-------+
| bucket            | key                                     | batch |
+-------------------+-----------------------------------------+-------+
| str               | str                                     | int32 |
+-------------------+-----------------------------------------+-------+
| "precise-scratch" | "hebrardms/SG100K_r2.0/igg/batch-1.mt"  |     1 |
| "precise-scratch" | "hebrardms/SG100K_r2.0/igg/batch-2.mt"  |     2 |
| "precise-scratch" | "hebrardms/SG100K_r2.0/igg/batch-3.mt"  |     3 |
| "precise-scratch" | "hebrardms/SG100K_r2.0/igg/batch-4.mt"  |     4 |
...

###
# List of path
manifest = ht_manifest.collect()
# Get last index, add 1 to init the loop
last_idx = manifest[-1].batch + 1
n = last_idx

while(len(manifest)>1):
    try:
        # Pop A
        m_A = manifest.pop(0)
        mt_A =  hl.read_matrix_table(f's3://{m_A.bucket}/{m_A.key}')
        idx_A = m_A.batch
        
        try:
            # Pop B
            m_B = manifest.pop(0)
            mt_B =  hl.read_matrix_table(f's3://{m_B.bucket}/{m_B.key}')
            idx_B = m_B.batch
            
            # Union
            mt_U = mt_A.union_cols(mt_B, 'outer')
            idx_U = n
            path_U = union_prefix+'/'+'union-' + str(idx_U) + '_' + str(idx_A) + '-' + str(idx_B) + '.mt'
            m_U = hl.Struct(**{'bucket': bucket, 'key': path_U, 'batch': idx_U})

            # Save path
            manifest.append(m_U)
            # Write
            mt_U.write(f's3://{m_U.bucket}/{m_U.key}', overwrite=True)
            
            # Detele source file if not in initial manifest
            if (idx_A >= last_idx):
                bt_bucket.objects.filter(Prefix=m_A.key).delete()
            if (idx_B >= last_idx):
                bt_bucket.objects.filter(Prefix=m_B.key).delete()

            # Next
            n+=1
        except:
            print('Cannot read '+m_B.key)
            break
    except:
        print('Cannot read '+m_A.key)
        break

print('input: '+f'{last_idx:,}'+' - iterations: '+f'{n-1:,}')

I am curous to know if that approach make sense or if there is optimisation for VDS

You should run all of your GVCFs into a single combiner job. It’s designed to handle very large numbers of GVCFs at a time. It appropriately subdivides the work.

The following should work for your data.

ls_gvcf = ...
combiner = hl.vds.new_combiner(
    output_path=f'${VDS_PREFIX}/SG10K_Health.vds',
    temp_path='${VDS_TMP}/',
    gvcf_paths=ls_gvcf[batch:(batch+step)],
    use_genome_default_intervals=True,
)
combiner.run()

Thanks Chris for the insight.

I encounter some issue

AmazonS3Exception: The operation is not valid for the object's storage class (Service: Amazon S3; Status Code: 403; Error Code: InvalidObjectState;)

Note that the gVCF are stored in cold storage. I restored them prior to the compute.

/!\ I did not restore the index, *.gvcf.gz.tbi files are still in cold storage. I hope the issue is due to this and will be solved as soon as I restore the tbi files.

I can confirm that restoring the *.tbi solved the issue of InvalidObjectState