Genotype matrix in hail 0.2

Dear Hail team,

Given a genomic region, is it possible to output a genotype matrix (M individuals by N SNPs, in which the values are minor allele counts (0, 1, 2)) with hail 0.2?

Thank you very much!
Best regards,
Wei

Hi Wei,
Try the following:

mt = hl.import_vcf('src/test/resources/sample.vcf')

mt = hl.filter_intervals(mt, [hl.parse_locus_interval('20:10620000-10650000')])

mt = mt.select_entries(GT = mt.GT.n_alt_alleles())

mt.make_table(separator='').export('/tmp/foo.txt')

You can change the interval(s) and files as necessary.

Hi Tim,

Thank you so much! That is exactly what I need.

Best regards,
Wei

Hi @tpoterba,

I was trying your suggestion to export a table containing ‘samples’ and ‘counts’ and I’m getting ‘RuntimeException: Class file too large!’.

Here what I did,

# aggregate by AF bins and consequence type
mt_grouped = (mt
              .group_rows_by(mt.af_bins, mt.csq_group)
              .aggregate_entries(n_hets=agg.count_where(mt.GT.is_het()))
              .result()
              )

# export table
tb = (mt_grouped
      .make_table(separator='')
      .export('samples_counts.txt')
      )

The grouped matrix have 10k samples, six consequence groups and 10 AF bins…so I expect a file with 600,000 rows (that’s small).

any idea? do you have any other suggestion to get a table with four columns (e.g. sample_id, af_bin, csq_group and counts (entries))?

Best,

E.

the code you posted above will export a table with 10,000 fields – and this is where Hail is having trouble.

If you instead run

tb = (mt_grouped
      .entries()
      .export('samples_counts.txt'))

I expect things should work, and give you the 4-column file you want.

1 Like

Hi Tim,

It works perfectly! :wink:

Thanks

1 Like