Filtering samples with extreme heterozygosity in hail?

Hi,

We are trying to use HAIL package to create a GWAS tutorial that is comparable to one we have made using R and Plink in hopes of being able to leverage Spark and possible improve the computational speed and/or memory consumption when conducting analyses (especially for larger datasets). Currently when attempting to perform quality control steps, we are able to get similar results when looking at variant call rate, minor allele frequency, and sample call rate outputs in both R and hail. When looking at heterozygosity to further assess sample quality and calculate a heterozygosity f statistic cut off to filter samples, we find sample qc statistics and available methods differ from R.

Our Current Approach in R:

The R snpStats package provides heterozygosity as an output which we plotted to determine what thresholds we wanted to use and manually calculated a heterozygosity f statistic/inbreeding cut off using minor allele frequency (for expected heterozygosity), and heterozygosity and N called for (observed heterozygosity). We then determined a f-statistic and calculated an f statistic cut off given the heterozygosity distribution to filter samples.

In Hail:

In hail we calculated mean heterozygosity by using the n_het/n_called columns output from the sample_qc method.

What is the best approach for filtering samples with extreme heterozygosity in hail?

Currently we are attempting a similar approach to what we did in R and calculate a heterozygosity f statistic threshold by determining the max of the absolute value of the f statistic from samples within 2-2.5 standard deviations of the mean heterozygosity. To do this we plan to use the annotate_cols and inbreeding aggregator to calculate the fstatistic and determine the thresholds using the n_het/n_called columns output from the sample qc method. Are there columns or stats included in the output from the sample qc method we should be using instead?

Can you provide the Hail Python script you’re running? I would expect this to work. I don’t have a sense of what snpStats is doing precisely, though.

An alternative to the heterozygosity f-statistic is to use the Hardy-Weinberg test p-value. This is computed by hardy_weinberg_test aggregator and included in the statistics computed by the variant_qc method.

Thanks for the reply.

Below is the code for filtering samples with extreme heterozygosity in hail as well as R. Ultimately, the fstatistic value used as a cut off for filtering the samples with hail vs. R, but are similar. Given these results we have the following questions:

  • Did we implement the filtering process correctly in hail? Is there any mistake we have made that could lead to the difference in the f-statistic cut off?
  • If the hail code is correct should we consider the difference insignificant?
  • If we were to investigate filtering samples using Hardy Weinberg as suggested by Cotton Seed below, how is the variant_qc method used to filter samples? We do plan to filter variants using Hardy Weinberg as the next QC step, but were unsure how to apply this method when filtering samples.

#calculate heterozygosity using the n_het and n_called fields from the sample_qc method and fstatistic using the inbreeding method

m1=m1.annotate_cols(heterozygosity=((m1.sample_qc.n_het/m1.sample_qc.n_called)), inbreeding = hl.agg.inbreeding(m1.GT, m1.variant_qc.AF[1]))

#given the heterozygosity distribution, determine the min and the max heterozygosity values to be used when calculating a f-statistic cut off to filter samples

het_stats=m1.aggregate_cols(hl.agg.stats(m1.heterozygosity))

het_low=(het_stats.mean)-2.3* (het_stats.stdev)

het_high=(het_stats.mean)+2.3 *(het_stats.stdev)

het_data= m1.filter_cols((m1.heterozygosity <= het_high) & (m1.heterozygosity >= het_low))

#calculate the fstatistic cut off

het_fstat_cutoff = max(map(abs, het_data.inbreeding.f_stat.collect()))

#filter samples filtered using fstatistic and sample call rate

m1_filtered_fstat_scallRate = m1.filter_cols(((m1.inbreeding.f_stat <= abs(het_fstat_cutoff_min)) &

(m1.inbreeding.f_stat >= het_fstat_cutoff_min)) &

(m1.sample_qc.call_rate > 0.95))

m1_filtered_fstat_scallRate.count_cols()

The same approach followed in R is as follows:

#heterozygosity is returned from the row.summary (snpsum.row) command from the snpStats package (we have compared distributions and they are they same)

snpsum.row ← row.summary(genotype)

callmatrix ← !is.na(genotype) #get call matrix of complete calls

#calculate fstatististic (hetf) (using expected/observed heterozygosity)*

MAF ← snpsum.col$MAF

hetExp ← callmatrix %% (2MAF*(1-MAF))

hetObs ← with(this_snpsum.row,Heterozygosity*(ncol(this_genotype))*Call.rate)

snpsum.row$hetF ← 1-(hetObs/hetExp)

#given the heterozygosity distribution, determine the min and the max heterozygosity values to be used when calculating a f-statistic cut off to filter samples

snpsum.row_low ← min(snpsum.row$Heterozygosity[snpsum.row$Heterozygosity >= het_low])

snpsum.row_up ← max(snpsum.row$Heterozygosity[snpsum.row$Heterozygosity <= het_up])

snpsum.row_up_low ← snpsum.row[snpsum.row$Heterozygosity == snpsum.row_low | snpsum.row$Heterozygosity == snpsum.row_up,]

#set heterozygosity F Statistic cut off

hetcutoff ← max(abs(snpsum.row_up_low$hetF))

The outputs from both sets of code above are as follows

Calculation Hail R
Min heterozygosity value within in 2.3 sdev of mean 0.31 0.31
Max heterozygosity value within in 2.3 sdev of mean 0.34 0.33
F statistic cut off 0.029 0.028
Samples filtered 33 (1368 remain) 37 (1364 remain)

When attempting to filter samples using Hardy Weinberg as Cotton Seed suggested, we tried the following:

m1_filtered_hwe = m1.filter_rows(m1.variant_qc.p_value_hwe > 10e-7)
m1_filtered_hwe.count_rows() #variants are filtered, yet the sample counts remain the same.

This makes sense given the method, but how should we be using it to filter samples?

Thanks for the reply.
I have mentioned the code in Hail and R. We would also want to understand few questions based on the code.

Below is the code for filtering samples with extreme heterozygosity in hail as well as R. Ultimately, the fstatistic value used as a cut off for filtering the samples with hail vs. R, but are similar. Given these results we have the following questions:

  • Did we implement the filtering process correctly in hail? Is there any mistake we have made that could lead to the difference in the f-statistic cut off?
  • If the hail code is correct should we consider the difference insignificant?
  • If we were to investigate filtering samples using Hardy Weinberg as suggested by Cotton Seed below, how is the variant_qc method used to filter samples? We do plan to filter variants using Hardy Weinberg as the next QC step, but were unsure how to apply this method when filtering samples.

#calculate heterozygosity using the n_het and n_called fields from the sample_qc method and fstatistic using the inbreeding method

m1=m1.annotate_cols(heterozygosity=((m1.sample_qc.n_het/m1.sample_qc.n_called)), inbreeding = hl.agg.inbreeding(m1.GT, m1.variant_qc.AF[1]))

#given the heterozygosity distribution, determine the min and the max heterozygosity values to be used when calculating a f-statistic cut off to filter samples

het_stats=m1.aggregate_cols(hl.agg.stats(m1.heterozygosity))

het_low=(het_stats.mean)-2.3* (het_stats.stdev)

het_high=(het_stats.mean)+2.3 *(het_stats.stdev)

het_data= m1.filter_cols((m1.heterozygosity <= het_high) & (m1.heterozygosity >= het_low))

#calculate the fstatistic cut off

het_fstat_cutoff = max(map(abs, het_data.inbreeding.f_stat.collect()))

#filter samples filtered using fstatistic and sample call rate

m1_filtered_fstat_scallRate = m1.filter_cols(((m1.inbreeding.f_stat <= abs(het_fstat_cutoff_min)) &

(m1.inbreeding.f_stat >= het_fstat_cutoff_min)) &

(m1.sample_qc.call_rate > 0.95))

m1_filtered_fstat_scallRate.count_cols()

The same approach followed in R is as follows:

#heterozygosity is returned from the row.summary (snpsum.row) command from the snpStats package (we have compared distributions and they are they same)

snpsum.row ← row.summary(genotype)

callmatrix ← !is.na(genotype) #get call matrix of complete calls

#calculate fstatististic (hetf) (using expected/observed heterozygosity)*

MAF ← snpsum.col$MAF

hetExp ← callmatrix % % (2 MAF*(1-MAF))

hetObs ← with(this_snpsum.row,Heterozygosity*(ncol(this_genotype))*Call.rate)

snpsum.row$hetF ← 1-(hetObs/hetExp)

#given the heterozygosity distribution, determine the min and the max heterozygosity values to be used when calculating a f-statistic cut off to filter samples

snpsum.row_low ← min(snpsum.row$Heterozygosity[snpsum.row$Heterozygosity >= het_low])

snpsum.row_up ← max(snpsum.row$Heterozygosity[snpsum.row$Heterozygosity <= het_up])

snpsum.row_up_low ← snpsum.row[snpsum.row$Heterozygosity == snpsum.row_low | snpsum.row$Heterozygosity == snpsum.row_up,]

#set heterozygosity F Statistic cut off

hetcutoff ← max(abs(snpsum.row_up_low$hetF))

The outputs from both sets of code above are as follows

Calculation Hail R
Min heterozygosity value within in 2.3 sdev of mean 0.31 0.31
Max heterozygosity value within in 2.3 sdev of mean 0.34 0.33
F statistic cut off 0.029 0.028
Samples filtered 33 (1368 remain) 37 (1364 remain)

When attempting to filter samples using Hardy Weinberg as Cotton Seed suggested, we tried the following:

m1_filtered_hwe = m1.filter_rows(m1.variant_qc.p_value_hwe > 10e-7)
m1_filtered_hwe.count_rows() #variants are filtered, yet the sample counts remain the same.

This makes sense given the method, but how should we be using it to filter samples?

What I think is going on is the way that the sample missingness is handled in Hail vs R.
In R, you are explicitly removing sample rows using !is.na
As for Hail, we are handling missingness much like python. When one of the column items is missing, the resulting computation will result in missing. So, in Hail, you are not filtering out as many samples as you would in R due to missingness and that is likely the reason that you are having differing outputs.

Thanks Kumar. That makes sense. But also had one more question regarding using the variant_qc.p_value_hwe present as one of the fields in the matrix table. How should we filter the samples on this field for the heterozygosity.

this is a variant field, not a sample field. Could you describe in words what you want to do to filter samples using HWE?