Impute_sex() problems with multi-sample gVCF

antoniocampos13 · June 8, 2021, 3:38pm

Hello, I am testing Hail version 0.2.61 to use in my project. I am having problems with impute_sex() function. I got 22 gVCF files produced with Illumina’s DRAGEN pipeline. Also using DRAGEN, I combined those 22 gVCFs into a single, multi-sample gVCF. I imported the multi-sample gVCF into Hail and converted to Hail’s MatrixTable format. I am using GRCh38 as the genome reference.

I ran impute_sex, but when I checked the output, it returns an empty table (see screenshot). Below is the outline of my script so far. Thanks for the help.

CALL_RATE = 0.99
MALE_THRESHOLD = 0.8
FEMALE_THRESHOLD = 0.5
sex_colnames = [“f_stat”, “is_female”]

mt = hl.read_matrix_table(out_mt_path)

samples = mt.s.show(22)

mt = hl.split_multi_hts(mt)

table = hl.import_table(ANNOTATION_PATH,
impute=True,
key=“Sample”)

print(table.count())

mt = mt.annotate_cols(pheno = table[mt.s])

mt = hl.sample_qc(mt)

mt = mt.filter_cols(mt.sample_qc.call_rate >= CALL_RATE)

(Create new table to not overwrite main data)

sex_ht = hl.impute_sex(mt.GT, female_threshold=FEMALE_THRESHOLD, male_threshold=MALE_THRESHOLD)
sex_ht = sex_ht.select(*sex_colnames)

mt = mt.annotate_cols(**sex_ht[mt.s])

mt.is_female.show()
is_female_empty

tpoterba · June 8, 2021, 4:08pm

This show() output indicates that there are no samples in your matrix table.

Can you put print(mt.count_cols()) before and after this filter?

antoniocampos13 · June 9, 2021, 4:10pm

Ok, I changed the call rate threshold to 0.97, put a new filtering step for read depth and changed the impute_sex call to include only biallelic variants

CALL_RATE = 0.97
MALE_THRESHOLD = 0.8
FEMALE_THRESHOLD = 0.5
READ_DEPTH = 20
sex_colnames = [“f_stat”, “is_female”]

mt = hl.read_matrix_table(out_mt_path)

samples = mt.s.show(22)

mt = hl.split_multi_hts(mt)

table = hl.import_table(ANNOTATION_PATH,
impute=True,
key=“Sample”)

print(table.count())

mt = mt.annotate_cols(pheno = table[mt.s])

mt = hl.sample_qc(mt)

mt = mt.filter_cols(mt.sample_qc.call_rate >= CALL_RATE)

mt = mt.filter_cols(mt.sample_qc.dp_stats.mean > READ_DEPTH)

Then, I ran the command you suggested

mt.count_cols()

It output 20 (2 samples removed)

(Create new table to not overwrite main data)

sex_ht = mt.filter_rows((hl.len(mt.alleles) == 2) & (mt.was_split == False))
sex_ht = hl.impute_sex(sex_ht.GT, female_threshold=FEMALE_THRESHOLD, male_threshold=MALE_THRESHOLD)
sex_ht = sex_ht.select(*sex_colnames)

mt = mt.annotate_cols(**sex_ht[mt.s])

mt.is_female.show()

The is_female.show() output shows the 20 samples as expected, but is_female column is all NA. Do you suspect what I am doing wrong?
hail2

Topic		Replies	Views
How to use impute_sex() on a run_combiner() sparse MatrixTable? Hail Query & hailctl	5	390	May 21, 2025
Trying to recode output from imputed_sex to 0 or 1 Hail Query & hailctl	3	76	May 31, 2024
Non diploid and multiallelic genotypes issues Hail Query & hailctl	0	283	June 23, 2023
Typo in impute_sex() description Meta	2	689	April 24, 2019
Running impute_sex Hail Query & hailctl	2	352	November 2, 2021

Impute_sex() problems with multi-sample gVCF

Related topics