VCF format for hail in seqr

Hi all, could you possibly help a newbie get my VCFs into the right format to work with hail?

I’m hoping to use seqr to analyse my own WGS due to severe health problems, but am totally new to bioinformatics (and linux) and can’t work out how to adapt my VCFs.

My sequence is from 10x Chromium and the file I’d most like to use is a VCFv4.1 from LongRanger, since that’s best placed to take advantage of the linked reads. If I try to load it to seqr I get:

HailException: invalid PL field `0,2510,18100’: expected 6 values, but got 3.

which is confusing me because the VCFv4.2 spec says PL field is supposed to have 3 values. Could something else in my file be telling it to expect 6?

I also have a VCFv4.2 from GATK pipelines, but running that in seqr throws up lots of this warning:

Hail: WARN: Struct{allele_num:Int,amino_acids:String,biotype:String,canonical:Int,ccds:String,cdna_start:Int,cdna_end:Int,cds_end:Int,cds_start:Int,codons:String,consequence_terms:Array[String],distance:Int,domains:Array[Struct{db:String,name:String}],exon:String,flags:String,gene_id:String,gene_pheno:Int,gene_symbol:String,gene_symbol_source:String,hgnc_id:String,hgvsc:String,hgvsp:String,hgvs_offset:Int,impact:String,intron:String,lof:String,lof_flags:String,lof_filter:String,lof_info:String,minimised:Int,polyphen_prediction:String,polyphen_score:Double,protein_end:Int,protein_start:Int,protein_id:String,sift_prediction:String,sift_score:Double,strand:Int,swissprot:String,transcript_id:String,trembl:String,uniparc:String,variant_allele:String} has no field tsl at .transcript_consequences.

then eventually just stops, no index loaded into elasticsearch.

Any help would be much appreciated.

HailException: invalid PL field `0,2510,18100’: expected 6 values, but got 3.

The PL field has number “G”, which means “one value for each possible genotype”. The number of possible genotypes is (N * (N + 1)) / 2, where N is the number of total alleles (including the reference). So if the ALT field of the VCF has only one allele (no comma), then there should be 3 values (1 alt + 1 reference = 2, 2 * 3 / 2 = 3). If there are 2 alternates, there should be 6 values. I assume this is what happened!

Which version of Hail are you using, 0.1 or 0.2? I’d really recommend using 0.2, especially since we plan to deprecate 0.1 in less than a month! My initial investigation says that your “Invalid PL field” error wouldn’t happen in 0.2.

Hi tpoterba, thanks for the quick response. Ah, that would explain it! I’m using whatever version of hail is currently in seqr… I think 0.1? I shall look into it and see if there’s any way to move over to 0.2. If not is there any other way around it?

Let me poke around the Seqr repo for a moment and report back. I know that the Seqr team hasn’t moved over to 0.2 yet, though, but I don’t know how married Seqr is to a specific Hail version (if it has hail scripts, etc)

OK, I haven’t found anything specific that suggests it’s impossible to load data into Seqr from Hail 0.2.

@bw2 is that right? for example, the load_dataset_to_es.py script referenced in the README doesn’t seem to be part of the repo.

Thank you - so maybe best to install Hail locally you think?

I think you should be able to run both Hail and Seqr locally, Ben can confirm about the Seqr part. Depending on your machine size, things could take a while though – we work with datasets >10,000 whole genomes, but if you have just one, the variant annotations (predictions of protein functional consequence, for instance) are still big.

Thank you. I’ve got Seqr running locally but haven’t been able to upload any data yet - just waiting for the next update which should sort out a few probs I was having with CPU workload when loading the example project.

Lol I can’t even imagine working with 10,000 whole genomes! Slow is OK, I’m really just grateful to find something I can use. I realise it may be rather overkill to use Seqr for my one genome (or three when we get the data back for my parents) - but I’ve tried everything I could think of for months and found the others are all either too user-unfriendly, can’t handle my files, or only work with GRCh37 - and mine’s 38.

Do you have any idea what the problem was with my other VCF, the GATK one where it just gives lots of Struct warnings then stops? Or is it supposed to do that and I’ve simply missed that there’s another step I need to take?

I don’t really know how to interpret that Seqr error, sorry :frowning: Looks like it’s coming from VEP inside Seqr?

Note that it’ll only be slow because you’ll be held back by lack of available compute – people here are working with 10,000 whole genomes on jupyter notebooks backed by ~1500 core clusters on Google Cloud Platform.

This might be another option for you – you can use a ~100 core cluster for about $2.50 per hour, which will make your analysis MUCH faster than using Hail on a few cores. Hail is pretty flexible, and though it’s not an indexed search engine, it means you can ask all kinds of questions about your data. We also have some decent tooling for working on Google cloud: https://github.com/Nealelab/cloudtools

It sounds like you’ve done your research, but If I were in your position, I would use the cloud, annotate my variant data with the Variant Effect Predictor (VEP), and look for gene knockout mutations that aren’t found in public databases of healthy individuals like gnomAD. When you get your parents’ data your analysis gets MUCH easier – you look for either inherited recessive mutations (both parents carrier Aa, you aa), or de novo mutations. If you find something there, it’ll be a great candidate.

Ah OK, I thought the error was still while it was in Hail. Probably better off sticking to the LongRanger files anyway.

I hadn’t thought of using a cloud cluster, definitely worth considering. It may be particularly helpful further down the line if/when I start having to look for needles in haystacks. If I’m lucky I may find I have known mutations for the most strongly suspected conditions, but it’s quite likely I’ll have to look at novel variants or less well described gene-disease associations. Of course with the Chromium data there’s the possibility of phased haplotype analysis…

Thanks again for your help, I really appreciate you taking the time.

that error is in Hail, but I don’t know how Seqr is using Hail in that case. It’s annotating with VEP but I’m not sure how it’s configured

And yeah, the phasing could certainly help, but I think you won’t need it until the very end – the kind of thing it can tell you is whether, for instance, two knockout mutations in the same gene are on the same chromosome (in which case they function as one knockout) or different (double knockout). But you can easily calculate whether you have two knockout mutations in any gene without phased data!

You can also phase using short read WGS data if you have both parents using imputation :slight_smile:

Maybe Ben will know about the Seqr-Hail part. They seem to be constantly working hard to improve Seqr so I imagine a move to incorporate Hail 0.2 is in the pipeline anyway.

True, but I won’t have my parents’ data for a while and I’m very keen to get on with the analysis :slight_smile:

Most of the conditions I’m looking at first are autosomal recessive, commonly compound heterozygous with several hundred pathogenic mutations discovered so far, many of which are partial LOF rather than total knockout - so phasing could be quite helpful. It seems possible that I have two (or more) distinct genetic conditions, each of which may be relatively mild in its own right but which combined cause the more severe and complex clinical presentation. It’s certainly interesting trying to find out!

Yeah, they’ll switch over to 0.2 after we deprecate 0.1. It hasn’t made a huge amount of sense yet – 0.1 is stable and working well for them.

I haven’t really used Seqr - do you know if you can ask this kind of question in Seqr? I know you can ask it in Hail (annotate functional consequence, group by gene, aggregate to compute knockout variants per gene, show genes with >1 knockout)

I’m not sure - there’s not much publicly available info on Seqr - but from the few things I have seen I think it may be possible. I know they’re using it for rare disease gene discovery, so I think it’s pretty powerful in that way. For me the “intuitive browser-based system” part is very attractive - I have some knowledge of medical genetics, but until very recently had only just about enough experience of computing to know that python can be something other than a reptile. It’s a very steep learning curve - but fascinating.

@jules-o Have you considered Seave ? It is designed for clinical cases research/diagnoses at Kinghorn Centre for Clinical Genomics. It has very limited scalability beyond single VCF/trio. Automates a lot in diagnostic process and has the latest annotations updates. Browser-based and can be installed locally. Never used it mysef, so can’t tell about other limitations, but it is widely used at KCCG and might worth to try in your case. You can assess its functionality with a demo dataset on https://www.seave.bio


1 Like

Ooh that sounds hopeful - thank you, I shall have a look.

Ah, unfortunately Seave only supports GRCh37. I have tried lifting over but there were too many calls that couldn’t be lifted for my liking. I think, having got this far, I shall keep trying with Seqr/Hail and see what I can do.

ahhh, bummer. Yeah, sounds good.