command: vds.impute_sex(maf_threshold=0.05, include_par=False, female_threshold=0.2, male_threshold=0.8, pop_freq=None)
resulting in imputesex.isFemale majority is none, and rest are true/false, which is not correct according to sample labeling. Any reason causes it? Thanks.
None indicates that the inbreeding coefficient lies somewhere between 0.2 and 0.8. It might be a good idea to plot a histogram of the F statistic and see what that looks like. This could indicate poor data quality, or might be expected if you have very few X chromosome sites.
Also, if your dataset isn’t GRCh37, that could explain it – 0.1 is built for that reference genome and won’t process others correctly. 0.2 (
https://www.hail.is/docs/devel) solves that problem.
Does hg19 or hs37d5 or b37 considered as your GRCh37 definition?
Hail 0.1 works well with contigs “1”, “2”, …“X”, “Y”.
If the contigs are named “chr1”, “chr2”, … “chrX”, bad things happen!
I’d really recommending switching to 0.2 if possible! in 0.2 the reference genome is parameterized:
Hail 0.1 did not work well with genetic datasets that
were not aligned to b37. The contig sort order was hard-coded as 1-22, X, Y, MT. Any additional contigs not in this list were sorted
alphabetically. This caused datasets with contigs that contained chr
prefixes such as hg38 to be shuffled on import. In addition, the sex
chromosomes and mitochondrial contig name were hardcoded to be X, Y,
and MT and the coordinates of the pseudoautosomal region on the X and Y
chromosomes were hardcoded f…
I see, then that’s not an issue for this dataset, I do see majority of f score is around .5. Does that mean impute sex didn’t work for my dataset?
F around 0.5 means that the sample has 50% of the expected number of heterozygous sites. This does indicate a problem either with the data or the way it’s represented.
So what are the potential causes (or how do I troubleshoot in such case)?
If you compute the number of heterozygous genotypes on the X chromosome for each sample, what do you see?
Sorry still very new to hail, what’s the filter expression for het and x and count per sample? Thanks.
if you’re new to hail you should totally switch to 0.2. Then it will be something like:
import hail as hl
mt = hl.import_vcf('...')
mt = mt.annotate_cols(
n_het_per_sample = hl.agg.count_where(mt.locus.in_x_nonpar() & mt.GT.is_het()))
het_stats = mt.n_het_per_sample.collect()
Well I couldn’t control the version being installed on our company computing environment and currently it is 0.1… Would the above code works for 0.1 version? Thanks.
No, the interface in 0.1 is totally different from 0.2.
I think something along the lines of the above:
vds = (vds
.filter_variants_expr('v.contig == "X"')
.annotate_samples_expr('sa.nHetX = gs.filter(g => g.isHet()).count()')
Name: sa.nHetX, dtype: float64
This looks pretty normally distributed. Where are your data coming from?
whole exome sequencing going through gatk best practice pipeline.
Do you have an idea which samples are actually male? That could help identify problem sites on the X chromosome.
Other than that, I don’t know how much we can help – this is a scientific problem, not a technical one.