Inconsistent sample qc results

Hi,

I’m running the following bit of code on exome sequenced data and finding that i get different results when I plot my sample call rate after. I was wondering if the order of commands makes a difference, or if there is something else I am missing. Thanks!

scz = hl.read_matrix_table('Analyses/scz_main.mt')                     

scz = hl.sample_qc(scz, name = 'sample_qc') 
         
scz.count_rows()  
                                                     
scz = scz.filter_rows(scz.alleles.length() <= 6)

scz = hl.split_multi_hts(scz, permit_shuffle=True)                     

scz = scz.filter_entries(
    hl.is_defined(scz.GT) &
    (
        (scz.GT.is_hom_ref() & 
            (
                 ((scz.AD[0] / scz.DP) < 0.8) | 
                (scz.GQ < 20) |
                (scz.DP < 20)
        	)
        ) |
        (scz.GT.is_het() & 
        	( 
                (((scz.AD[0] + scz.AD[1]) / scz.DP) < 0.8) | 
                ((scz.AD[1] / scz.DP) < 0.2) | 
                (scz.PL[0] < 20) |
                (scz.DP < 20)
        	)
        ) |
        (scz.GT.is_hom_var() & 
        	(
                ((scz.AD[1] / scz.DP) < 0.8) |
                (scz.PL[0] < 20) |
                (scz.DP < 20)
        	)
        )
    ),
    keep = False
)                                                                   

scz.count_rows() 

I’m using hail version 0.2.36-ed011219dd93

Could you try explaining the issue in a different way, perhaps with the full, exact code and output for each scenario?

It makes sense that the call rate changes after you filter some entries, as described in the code you shared. Why do you expect sample call rate would not change?

Hi, that’s the exact code, but I’ve had to run it multiple times and seem to get one of two different outputs when I plot the call rate after filtering (see attached photos)

.

Code for call rate:
callrate_geno = hl.plot.histogram(scz.sample_qc.call_rate, range=(0,1), legend=‘Call Rate’)
export_png(callrate_geno, filename=‘Output/callrate_geno.png’)

Hmm. I’m sorry that’s happening!

Do you run this on a private cluster, in the cloud, or on a single machine? If you run it on the cloud, what command do you use to start the cluster?

When you run it multiple times, is each time on the same cluster/machine or on different clusters/machines?

Does this happen reliably? For example, if you run the script twice in a row like: python3 myscript.py; mv Output/callrate_geno.png Output/callrate_geno-1.png; python3 myscript.py, does it produce different results?

Is it possible for you provide us with a script and an example dataset that produces the two different plots?

The code you’ve shared is deterministic, it should not produce different results unless Analyses/scz_main.mt changes.

Did you change Hail versions between runs? It’s unlikely but possible that there was a bug in one of those versions. If you have an example that differs between two versions of Hail, we could investigate that further.

Are these from running sample_qc before and after the pipeline in your original post above? I’d expect the call rate to go down after your filter_entries, since sample_qc call rate is defined as the number of defined calls per sample divided by the total number of variants.