Densify sparse mt

I’m trying to densify a sparse mt of over 200,000 samples and 2787 partitions. This is the code I’m trying to run:

    hl.init(default_reference='GRCh38', log='/densify_sparse_mt.log')
    freeze = args.freeze

    logger.info('Reading in input mt (raw sparse mt)')
    mt = hl.read_matrix_table(args.input)
    logger.info(f'Sparse mt count: {mt.count()}')

    logger.info('Densifying mt')
    mt = hl.experimental.densify(mt)

    logger.info('Filtering out lines that are only reference')
    mt = mt.filter_rows(hl.len(mt.alleles) > 1)
    logger.info(f'Count after filtration: {mt.count()}')

    logger.info('Writing out mt')
    mt = mt.repartition(30000)
    mt.write(raw_mt_path('broad', freeze), args.overwrite)

This has been running for over 17 hours, and it hasn’t downscaled (the cluster has autoscaling set). About how long does it take to downscale when using densify? Also, would it be better to write out the mt and then repartition?

densify_sparse_mt.log (3.7 MB)

Oops, thought I had responded!

I definitely wouldn’t repartition this using mt.repartition. It’s possible to repartition on read by reading with a set of intervals.

thanks! how do I repartition while reading? I’ve never used the _intervals parameter.

It’s not super easy, involves some of the combiner functions. Give me until the end of today to try to clean it up? I also found a possible bug…

er nevermind was just confusing myself. This should work for now:

def rep_on_read(path, n_desired):
     mt = hl.read_matrix_table(path)
     intervals = mt._calculate_new_partitions(n_desired)
     return hl.read_matrix_table(path, _intervals=intervals)
1 Like

awesome, thank you!!