Count and filter homozygous loci

Dear Hail community,

I am trying to first count for the number of experiments with homozygous reference in each loci and then filter out those loci with homozygous reference in all the experiments.

So, Is tart loading the GVCF and densify it.

sparseMatrix = hl.read_matrix_table( '{0}/chrom-{1}'.format( gvcf_store_path, chrom ) )
denseMatrix = hl.experimental.densify( familyMatrix )

So, running denseMatrix.LGT.show( 5 ) I get this:

locus	E012877.LGT	E012882.LGT
locus<GRCh37>	call	call
20:1	        0/0	0/0
20:60001	0/0	0/0
20:60014	0/0	0/0
20:60019	0/0	0/0
20:60022	0/0	0/0

Then I identify the homozygous reference with denseMatrix.LGT.is_hom_ref().show( 5 ):

locus	E012877.	E012882.
locus<GRCh37>	bool	bool
20:1	        true	true
20:60001	true	true
20:60014	true	true
20:60019	true	true
20:60022	true	true

But I do not see how to compute the sum of rows as an annotation:

denseMatrix.annotate_rows( nH = hl.eval( denseMatrix.LGTis_hom_ref() ) )

Nor how to perform the filtering:

denseMatrix = denseMatrix[ denseMatrix.nH( denseMatrix.nH == denseMatrix.count_cols) ]

Any suggestion is welcome.

Thanks in advance,
~Carles

This should be:

denseMatrix = denseMtrix.annotate_rows(
  nH = hl.agg.count_where(denseMatrix.LGT.is_hom_ref()))
1 Like

And then I can filter like this, right?

denseMatrix = denseMatrix.filter_rows( denseMatrix.nH < denseMatrix.count_cols() )

Thanks,
~Carles

yes, that will work!

However, there’s a better way to execute that filter:

denseMatrix = denseMatrix.filter_rows( hl.agg.any(denseMatrix.LGT.is_non_ref()))

1 Like