Impute sex resulting in majority imputesex.isFemale 'none'


command: vds.impute_sex(maf_threshold=0.05, include_par=False, female_threshold=0.2, male_threshold=0.8, pop_freq=None)

resulting in imputesex.isFemale majority is none, and rest are true/false, which is not correct according to sample labeling. Any reason causes it? Thanks.

None indicates that the inbreeding coefficient lies somewhere between 0.2 and 0.8. It might be a good idea to plot a histogram of the F statistic and see what that looks like. This could indicate poor data quality, or might be expected if you have very few X chromosome sites.

Also, if your dataset isn’t GRCh37, that could explain it – 0.1 is built for that reference genome and won’t process others correctly. 0.2 ( solves that problem.

Does hg19 or hs37d5 or b37 considered as your GRCh37 definition?

Hail 0.1 works well with contigs “1”, “2”, …“X”, “Y”.

If the contigs are named “chr1”, “chr2”, … “chrX”, bad things happen!

I’d really recommending switching to 0.2 if possible! in 0.2 the reference genome is parameterized:

I see, then that’s not an issue for this dataset, I do see majority of f score is around .5. Does that mean impute sex didn’t work for my dataset?

F around 0.5 means that the sample has 50% of the expected number of heterozygous sites. This does indicate a problem either with the data or the way it’s represented.

So what are the potential causes (or how do I troubleshoot in such case)?

If you compute the number of heterozygous genotypes on the X chromosome for each sample, what do you see?

Sorry still very new to hail, what’s the filter expression for het and x and count per sample? Thanks.

if you’re new to hail you should totally switch to 0.2. Then it will be something like:

import hail as hl
mt = hl.import_vcf('...')
mt = mt.annotate_cols(
    n_het_per_sample = hl.agg.count_where( & mt.GT.is_het()))

het_stats = mt.n_het_per_sample.collect()

Well I couldn’t control the version being installed on our company computing environment and currently it is 0.1… Would the above code works for 0.1 version? Thanks.

No, the interface in 0.1 is totally different from 0.2.

I think something along the lines of the above:

vds = (vds
    .filter_variants_expr('v.contig == "X"')
    .annotate_samples_expr('sa.nHetX = gs.filter(g => g.isHet()).count()')

count 1499.000000
mean 930.446298
std 342.690231
min 262.000000
25% 742.000000
50% 789.000000
75% 873.500000
max 2373.000000
Name: sa.nHetX, dtype: float64

This looks pretty normally distributed. Where are your data coming from?

whole exome sequencing going through gatk best practice pipeline.

Do you have an idea which samples are actually male? That could help identify problem sites on the X chromosome.

Other than that, I don’t know how much we can help – this is a scientific problem, not a technical one.