I tried several different approaches for my project firstly with MatrixTable, but the performance was bad.
Now I am trying to work only with Tables. My pipeline has 2 steps.
step 1: injest vcf, i.e import vcf files into Tables and store them in a folder(table per vcf approach)
step 2: querying Tables to get GT, GQ or DP fields for given set of variants and samples:
code.txt (1.6 KB)
Can you look into this code snippet and give me an advice/suggestion how can I improve/optimise the code, because I am disappointed by its performance?
note: I tried to add checkpoints or .persist() but the different wasn’t significant, but maybe they were on the wrong place…
Thanks in advance!