Possible vcf_combiner issue

Hey all,

Running

My inputs variable is just a list of strings of the full path to 49,953 gVCF files, and the tmp_path is just a folder path string and the out_file is also a string path. I am just trying to get a test sparse matrix table for chromosome 21 since it is the smallest. Every time I run this I get an error:

hl.experimental.run_combiner(sample_paths=inputs, out_file=output_file, tmp_path=temp_bucket, intervals=[hl.Interval(hl.Locus("21", 1, reference_genome='GRCh38'), hl.Locus("21", 46709983, reference_genome='GRCh38'), includes_end=True)], reference_genome='GRCh38')

Error:

AttributeError: StructExpression instance has no field, method, or property 'PGT'
    Did you mean:
        Data field: 'GT'
        StructExpression inherited method: 'get'

My files do not in fact have PGT in them, only GT:DP:MIN_DP:GQ, so I can see why it is erroring out, but this is not an issue in GLnexus for example. Is there an option so it does not expect fields that are not defined in the header? Or maybe my error is something different entirely?

Thanks,

Jim

The VCF combiner is still new functionality and was built to work for recent GATK GVCFs. It shouldn’t be hard to fix to support these files, though.

What’s the full FORMAT header for your file?

Hey Tim thanks for getting back to me.


##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotypes of reference and alternative alleles in order listed.",Source="weCall",Version="1.1.2">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Probabilistic allelic depths for the ref and alt alleles in the order listed (i.e. INFO::VC split out by sample).",Source="weCall",Version="1.1.2">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of reads overlapping the variant site (i.e. INFO::DP split out by sample). For reference calls the average depth (rounded to the nearest integer) over the region is reported.",Source="weCall",Version="1.1.2">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Phred-scaled genotype quality (i.e. posterior probability that the genotype call is incorrect).",Source="weCall",Version="1.1.2">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification.",Source="weCall",Version="1.1.2">
##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Probabilistic variant allelic frequencies for each alt allele (FORMAT::AD / FORMAT::DP).",Source="weCall",Version="1.1.2">
##FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set id.",Source="weCall",Version="1.1.2">
##FORMAT=<ID=PQ,Number=1,Type=Integer,Description="Phred-scaled phase quality (i.e. posterior probability that the phasing is incorrect).",Source="weCall",Version="1.1.2">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum read coverage observed within the reference block.",Source="weCall",Version="1.1.2">

My header is as above, should be the same for all 49k files (I would hope any way they’re all from UK Biobank).

ok, thanks, I’ll see if I can mock up a file that looks like this for testing!

Sure, I appreciate it. Should I use combine_gvcfs or something in the meantime? Is there a simple substitute for creating a sparse matrix table? I should also note that these gVCFs from UK Biobank were created using weCall, not GATK.

the problem is inside the combine logic in combine_gvcfs, so you’ll hit the same exception if you try that – we need to parameterize a set of FORMAT fields we’re aware of that need to be handled specially, and just pass the other FORMAT fields through.

Yeah I noticed that it looks specifically for an entry e to have a PGT field (e.PGT) though I’m assuming a simple try except will break a lot of other things haha

This change should fix it:

I haven’t added a test for your specific files, but this passes the tests for ours and makes the import significantly more flexible.

Alright, thanks. I’ll git pull this test branch then. I was running the version from pip earlier.

Jim Havrilla

So I tried to run it the same command as above but on an even smaller locus, ACE2:


hl.experimental.run_combiner(sample_paths=inputs, out_file=output_file, tmp_path=temp_bucket, intervals=intervals, reference_genome='GRCh38', overwrite=True, contig_recoding=contig_recoding)

And it looks like it’s getting through the first 5 stages based on the log file, but then it hangs for a really long time and the log repeatedly prints:


2020-06-09 23:30:26 NettyRpcEnv: WARN: Ignored failure: java.util.concurrent.TimeoutException: Cannot receive any reply from compute-0-9.local:39381 in 10 seconds

2020-06-09 23:30:28 Executor: WARN: Issue communicating with driver in heartbeater

org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval

Over and over again. I’m not sure if I should just let it run for a very long time. At your advice I am running this on a qlogin node with a log of cores and memory: qlogin -pe smp 64 -l s_vmem=2G. Previously, I used h_vmem=2G and I kept hitting Java heap errors but I found that can happen when running Java with h_vmem set instead of soft s_vmem. Now it just kind of hangs for a long time. Does choosing a small interval like


intervals=[hl.Interval(hl.Locus("chrX", 15360138, reference_genome='GRCh38'), hl.Locus("chrX", 15802945, reference_genome='GRCh38'), includes_end=True)]

Not make it run much faster? It’s only grabbing that one small region for so many files. Is it creating the whole gVCF first and then shrinking it to that interval? If that is the case I should just make the whole gVCF the matrix table. I’ll try running overnight and seeing if it gets anywhere, in parallel I’m trying GLnexus again as well.

This does finish on a small test set, in that case, after running:


mt=hl.methods.read_matrix_table(output_file)

mt2=hl.experimental.densify(mt)

hl.export_vcf(mt2, 'ace2.vcf.bgz')

I receive the error:


ValueError: Method 'export_vcf' requires row key to be two fields 'locus' (type 'locus<any>') and 'alleles' (type 'array<str>')

Found:

'locus': locus<GRCh38>

Not quite sure what is going on there either. Don’t know why this combination did not add the alleles field necessary to export the MatrixTable as a VCF.

Let me know your thoughts. And thanks for sticking with it.

Best,

Jim

This should definitely make it run faster, yes. What was the last Hail info message that was printed? I added a bunch of printouts about progress in the combiner recently, so we should have decent visibility into what it’s doing.

export_vcf should probably be relaxed to just require a locus, but for now you should pass the key_by_locus_and_alleles=True flag on run_combiner, which was added roughly for this purpose.

Yeah so I tried to merge all the 49k gVCFs for chr21, and I got Error summary: OutOfMemoryError: Java heap space. Now I know you said it’s out of core and not memory dependent but I ran this on a machine with 32 cores, and 1TB of ram. So this seems like a pretty big problem.

I did pass that into run_combiner now, thanks for the heads up.

Ah, when Spark starts in local mode, it allocates a very small amount of memory (<1G) by default. You can set this using an env variable, instructions here:

I was about to upload the log and script, but it sounds like that might do. So I’ll just tell it use the 1TB.

So I ran it for chr21 with all the above settings we discussed and this time it just stopped running after generating the 100 matrixtables of the first stage…

I’ve uploaded the .log file and the .py script in a zip file here:
https://github.com/jimhavrilla/images/files/4769700/hail-20200611-2016-0.2.44-6cfa355a1954.zip

After it finished stage 99, about line 120484 it just kind of didn’t keep going and it kept doing this over and over again until I guess it stopped.

(TableMapRows
 215252             (CastMatrixToTable "__entries" "__cols"
 215253               (MatrixLiteral))
 215254             (ApplyIR __uid_4 () Struct{locus:Locus(GRCh38),alleles:Array[String],rsid:String 215254 ,__entries:Array[Struct{LA:Array[Int32],LGT:Call,LAD:Array[Int32],LPL:Array[Int32],RGQ:Int32 215254 ,END:Int32,gvcf_info:Struct{PP:Array[Int32],DPR:Int32,DPF:Int32,VC:Array[Int32],VCR:Array[In 215254 t32],VCF:Array[Int32],MQ:Array[Float64],ABPV:Array[Float64],SBPV:Array[Float64],QD:Array[Flo 215254 at64],BR:Array[Float64],BEG:Int32,LEN:Int32},DP:Int32,GQ:Int32,VAF:Array[Float64],PS:String, 215254 PQ:Int32,MIN_DP:Int32}]}
 215255               (Ref row)))

On this note, is there really no way I just perform SKAT and other analyses on all of the matrixtables at once? Or must they be merged? I think merging them all into one is proving to be the most difficult step.

This is to be expected. It’s a transpose of data stored one-file-per-sample to a row-major representation that makes it possible to do things like GWAS.

I haven’t yet had a chance to look into the log above. How long did you wait? Some of the stuff in the vcf combiner requires a bit of wall time even if it’s not using much CPU in those phases.

I let it run all day. It just stopped running. As though it finished. But only those files were generated. It was strange, but the log was pretty difficult to make sense of for me. I gave my best interpretation of what was going on. Only files it seemed to generate were in tmp and only stage1 MTs 1-100.

Any updates? Would it be a better idea if I manually ran vcf_combiner for groups of say 1000 samples at a time, and combined their results? Can I export gVCFs or just regular VCFs?

This is quite surprising. I think the main problem is we don’t have a ton of visibility into what it’s doing here. I’m worried that the IO system on the cluster you’re running on is quite different from the ones Hail is commonly run on (laptops or cloud object stores) and some operation is really inefficient there.

Something you could try is running it again, waiting the ~20 minutes it takes to get to the place it’s stalling, then sending a SIGINT to the process to get a stacktrace for where it’s at.