Inconsistent sample qc results

wonu · April 21, 2020, 3:12pm

Hi,

I’m running the following bit of code on exome sequenced data and finding that i get different results when I plot my sample call rate after. I was wondering if the order of commands makes a difference, or if there is something else I am missing. Thanks!

scz = hl.read_matrix_table('Analyses/scz_main.mt')                     

scz = hl.sample_qc(scz, name = 'sample_qc') 
         
scz.count_rows()  
                                                     
scz = scz.filter_rows(scz.alleles.length() <= 6)

scz = hl.split_multi_hts(scz, permit_shuffle=True)                     

scz = scz.filter_entries(
    hl.is_defined(scz.GT) &
    (
        (scz.GT.is_hom_ref() & 
            (
                 ((scz.AD[0] / scz.DP) < 0.8) | 
                (scz.GQ < 20) |
                (scz.DP < 20)
        	)
        ) |
        (scz.GT.is_het() & 
        	( 
                (((scz.AD[0] + scz.AD[1]) / scz.DP) < 0.8) | 
                ((scz.AD[1] / scz.DP) < 0.2) | 
                (scz.PL[0] < 20) |
                (scz.DP < 20)
        	)
        ) |
        (scz.GT.is_hom_var() & 
        	(
                ((scz.AD[1] / scz.DP) < 0.8) |
                (scz.PL[0] < 20) |
                (scz.DP < 20)
        	)
        )
    ),
    keep = False
)                                                                   

scz.count_rows()

I’m using hail version 0.2.36-ed011219dd93

danking · April 21, 2020, 5:41pm

Could you try explaining the issue in a different way, perhaps with the full, exact code and output for each scenario?

It makes sense that the call rate changes after you filter some entries, as described in the code you shared. Why do you expect sample call rate would not change?

wonu · April 21, 2020, 6:08pm

Hi, that’s the exact code, but I’ve had to run it multiple times and seem to get one of two different outputs when I plot the call rate after filtering (see attached photos)

.

Code for call rate:
callrate_geno = hl.plot.histogram(scz.sample_qc.call_rate, range=(0,1), legend=‘Call Rate’)
export_png(callrate_geno, filename=‘Output/callrate_geno.png’)

danking · April 21, 2020, 7:06pm

Hmm. I’m sorry that’s happening!

Do you run this on a private cluster, in the cloud, or on a single machine? If you run it on the cloud, what command do you use to start the cluster?

When you run it multiple times, is each time on the same cluster/machine or on different clusters/machines?

Does this happen reliably? For example, if you run the script twice in a row like: python3 myscript.py; mv Output/callrate_geno.png Output/callrate_geno-1.png; python3 myscript.py, does it produce different results?

Is it possible for you provide us with a script and an example dataset that produces the two different plots?

The code you’ve shared is deterministic, it should not produce different results unless Analyses/scz_main.mt changes.

Did you change Hail versions between runs? It’s unlikely but possible that there was a bug in one of those versions. If you have an example that differs between two versions of Hail, we could investigate that further.

tpoterba · April 22, 2020, 11:42am

Are these from running sample_qc before and after the pipeline in your original post above? I’d expect the call rate to go down after your filter_entries, since sample_qc call rate is defined as the number of defined calls per sample divided by the total number of variants.

Topic		Replies	Views
Inconsistent per sample QC result Hail Query & hailctl	3	397	March 15, 2022
Code check to run WES Hail Query & hailctl	2	553	July 8, 2020
Hail sample_qc results Hail Query & hailctl	15	449	September 7, 2022
Inconsistent output from same code Hail Query & hailctl	2	404	November 4, 2020
Unable to do sample/variant QC after combining MatrixTable Hail Query & hailctl	11	418	January 8, 2023

Inconsistent sample qc results

Related topics