Densify sparse mt

ch-kr · November 19, 2019, 2:18pm

I’m trying to densify a sparse mt of over 200,000 samples and 2787 partitions. This is the code I’m trying to run:

    hl.init(default_reference='GRCh38', log='/densify_sparse_mt.log')
    freeze = args.freeze

    logger.info('Reading in input mt (raw sparse mt)')
    mt = hl.read_matrix_table(args.input)
    logger.info(f'Sparse mt count: {mt.count()}')

    logger.info('Densifying mt')
    mt = hl.experimental.densify(mt)

    logger.info('Filtering out lines that are only reference')
    mt = mt.filter_rows(hl.len(mt.alleles) > 1)
    logger.info(f'Count after filtration: {mt.count()}')

    logger.info('Writing out mt')
    mt = mt.repartition(30000)
    mt.write(raw_mt_path('broad', freeze), args.overwrite)

This has been running for over 17 hours, and it hasn’t downscaled (the cluster has autoscaling set). About how long does it take to downscale when using densify? Also, would it be better to write out the mt and then repartition?

ch-kr · November 19, 2019, 2:22pm

densify_sparse_mt.log (3.7 MB)

tpoterba · November 21, 2019, 1:36pm

Oops, thought I had responded!

I definitely wouldn’t repartition this using mt.repartition. It’s possible to repartition on read by reading with a set of intervals.

ch-kr · November 21, 2019, 1:41pm

thanks! how do I repartition while reading? I’ve never used the _intervals parameter.

tpoterba · November 21, 2019, 1:53pm

It’s not super easy, involves some of the combiner functions. Give me until the end of today to try to clean it up? I also found a possible bug…

tpoterba · November 21, 2019, 1:57pm

er nevermind was just confusing myself. This should work for now:

def rep_on_read(path, n_desired):
     mt = hl.read_matrix_table(path)
     intervals = mt._calculate_new_partitions(n_desired)
     return hl.read_matrix_table(path, _intervals=intervals)

ch-kr · November 21, 2019, 1:58pm

awesome, thank you!!

Topic		Replies	Views
Densify() operation Hail Query & hailctl	0	318	February 1, 2023
Very slow write and count operations after densify Hail Query & hailctl	0	311	January 31, 2023
Merge multiple sparse MT to one sparse MT Hail Query & hailctl	5	398	September 21, 2020
Best way to repartition heavily-filtered matrix tables? Hail Query & hailctl	10	654	August 24, 2021
Sparse mt entries question Hail Query & hailctl	5	483	November 7, 2019

Densify sparse mt

Related topics