Hello, I am testing Hail version 0.2.61 to use in my project. I am having problems with impute_sex() function. I got 22 gVCF files produced with Illumina’s DRAGEN pipeline. Also using DRAGEN, I combined those 22 gVCFs into a single, multi-sample gVCF. I imported the multi-sample gVCF into Hail and converted to Hail’s MatrixTable format. I am using GRCh38 as the genome reference.
I ran impute_sex, but when I checked the output, it returns an empty table (see screenshot). Below is the outline of my script so far. Thanks for the help.
CALL_RATE = 0.99
MALE_THRESHOLD = 0.8
FEMALE_THRESHOLD = 0.5
sex_colnames = [“f_stat”, “is_female”]
mt = hl.read_matrix_table(out_mt_path)
samples = mt.s.show(22)
mt = hl.split_multi_hts(mt)
table = hl.import_table(ANNOTATION_PATH,
impute=True,
key=“Sample”)
print(table.count())
mt = mt.annotate_cols(pheno = table[mt.s])
mt = hl.sample_qc(mt)
mt = mt.filter_cols(mt.sample_qc.call_rate >= CALL_RATE)
(Create new table to not overwrite main data)
sex_ht = hl.impute_sex(mt.GT, female_threshold=FEMALE_THRESHOLD, male_threshold=MALE_THRESHOLD)
sex_ht = sex_ht.select(*sex_colnames)
mt = mt.annotate_cols(**sex_ht[mt.s])
mt.is_female.show()
![is_female_empty](https://canada1.discourse-cdn.com/flex036/uploads/hail/original/1X/e17e78408683bfb43366e4631fb2f25d45daf615.png)
This show() output indicates that there are no samples in your matrix table.
Can you put print(mt.count_cols())
before and after this filter?
Ok, I changed the call rate threshold to 0.97, put a new filtering step for read depth and changed the impute_sex call to include only biallelic variants
CALL_RATE = 0.97
MALE_THRESHOLD = 0.8
FEMALE_THRESHOLD = 0.5
READ_DEPTH = 20
sex_colnames = [“f_stat”, “is_female”]
mt = hl.read_matrix_table(out_mt_path)
samples = mt.s.show(22)
mt = hl.split_multi_hts(mt)
table = hl.import_table(ANNOTATION_PATH,
impute=True,
key=“Sample”)
print(table.count())
mt = mt.annotate_cols(pheno = table[mt.s])
mt = hl.sample_qc(mt)
mt = mt.filter_cols(mt.sample_qc.call_rate >= CALL_RATE)
mt = mt.filter_cols(mt.sample_qc.dp_stats.mean > READ_DEPTH)
Then, I ran the command you suggested
mt.count_cols()
It output 20 (2 samples removed)
(Create new table to not overwrite main data)
sex_ht = mt.filter_rows((hl.len(mt.alleles) == 2) & (mt.was_split == False))
sex_ht = hl.impute_sex(sex_ht.GT, female_threshold=FEMALE_THRESHOLD, male_threshold=MALE_THRESHOLD)
sex_ht = sex_ht.select(*sex_colnames)
mt = mt.annotate_cols(**sex_ht[mt.s])
mt.is_female.show()
The is_female.show() output shows the 20 samples as expected, but is_female column is all NA. Do you suspect what I am doing wrong?
![hail2](https://canada1.discourse-cdn.com/flex036/uploads/hail/original/1X/3a56bf1a2ebf866a3132cfb3b968d794055cd562.png)