Impute sex resulting in majority imputesex.isFemale 'none'

trptyrphe · May 30, 2018, 6:52pm

Hi,

command: vds.impute_sex(maf_threshold=0.05, include_par=False, female_threshold=0.2, male_threshold=0.8, pop_freq=None)

resulting in imputesex.isFemale majority is none, and rest are true/false, which is not correct according to sample labeling. Any reason causes it? Thanks.

tpoterba · May 30, 2018, 6:54pm

None indicates that the inbreeding coefficient lies somewhere between 0.2 and 0.8. It might be a good idea to plot a histogram of the F statistic and see what that looks like. This could indicate poor data quality, or might be expected if you have very few X chromosome sites.

tpoterba · May 30, 2018, 6:54pm

Also, if your dataset isn’t GRCh37, that could explain it – 0.1 is built for that reference genome and won’t process others correctly. 0.2 (https://www.hail.is/docs/devel) solves that problem.

trptyrphe · May 30, 2018, 7:22pm

Does hg19 or hs37d5 or b37 considered as your GRCh37 definition?

tpoterba · May 30, 2018, 7:25pm

Hail 0.1 works well with contigs “1”, “2”, …“X”, “Y”.

If the contigs are named “chr1”, “chr2”, … “chrX”, bad things happen!

tpoterba · May 30, 2018, 7:25pm

I’d really recommending switching to 0.2 if possible! in 0.2 the reference genome is parameterized:

trptyrphe · May 30, 2018, 7:53pm

I see, then that’s not an issue for this dataset, I do see majority of f score is around .5. Does that mean impute sex didn’t work for my dataset?

tpoterba · May 30, 2018, 7:56pm

F around 0.5 means that the sample has 50% of the expected number of heterozygous sites. This does indicate a problem either with the data or the way it’s represented.

trptyrphe · May 30, 2018, 8:04pm

So what are the potential causes (or how do I troubleshoot in such case)?

tpoterba · May 30, 2018, 8:06pm

If you compute the number of heterozygous genotypes on the X chromosome for each sample, what do you see?

trptyrphe · May 30, 2018, 8:12pm

Sorry still very new to hail, what’s the filter expression for het and x and count per sample? Thanks.

tpoterba · May 30, 2018, 8:15pm

if you’re new to hail you should totally switch to 0.2. Then it will be something like:

import hail as hl
hl.init()
mt = hl.import_vcf('...')
mt = mt.annotate_cols(
    n_het_per_sample = hl.agg.count_where(mt.locus.in_x_nonpar() & mt.GT.is_het()))

het_stats = mt.n_het_per_sample.collect()

trptyrphe · May 30, 2018, 8:22pm

Well I couldn’t control the version being installed on our company computing environment and currently it is 0.1… Would the above code works for 0.1 version? Thanks.

tpoterba · May 30, 2018, 8:24pm

No, the interface in 0.1 is totally different from 0.2.

I think something along the lines of the above:

vds = (vds
    .filter_variants_expr('v.contig == "X"')
    .annotate_samples_expr('sa.nHetX = gs.filter(g => g.isHet()).count()')

trptyrphe · May 30, 2018, 8:34pm

Out[89]:
count 1499.000000
mean 930.446298
std 342.690231
min 262.000000
25% 742.000000
50% 789.000000
75% 873.500000
max 2373.000000
Name: sa.nHetX, dtype: float64

tpoterba · May 30, 2018, 8:45pm

This looks pretty normally distributed. Where are your data coming from?

trptyrphe · May 30, 2018, 8:47pm

whole exome sequencing going through gatk best practice pipeline.

tpoterba · May 30, 2018, 8:49pm

Do you have an idea which samples are actually male? That could help identify problem sites on the X chromosome.

tpoterba · May 30, 2018, 8:50pm

Other than that, I don’t know how much we can help – this is a scientific problem, not a technical one.

Topic		Replies	Views
Running impute_sex Hail Query & hailctl	2	353	November 2, 2021
Impute_sex() problems with multi-sample gVCF Hail Query & hailctl	2	460	June 9, 2021
UKBiobank chromosome XY Hail Query & hailctl	32	1502	December 10, 2018
Typo in impute_sex() description Meta	2	698	April 24, 2019
Import multiple .vcf.bgz by chromosome fails on chr Y Hail Query & hailctl	2	348	April 19, 2022

Impute sex resulting in majority imputesex.isFemale 'none'

Related topics