I’m trying to densify a sparse mt of over 200,000 samples and 2787 partitions. This is the code I’m trying to run:
hl.init(default_reference='GRCh38', log='/densify_sparse_mt.log')
freeze = args.freeze
logger.info('Reading in input mt (raw sparse mt)')
mt = hl.read_matrix_table(args.input)
logger.info(f'Sparse mt count: {mt.count()}')
logger.info('Densifying mt')
mt = hl.experimental.densify(mt)
logger.info('Filtering out lines that are only reference')
mt = mt.filter_rows(hl.len(mt.alleles) > 1)
logger.info(f'Count after filtration: {mt.count()}')
logger.info('Writing out mt')
mt = mt.repartition(30000)
mt.write(raw_mt_path('broad', freeze), args.overwrite)
This has been running for over 17 hours, and it hasn’t downscaled (the cluster has autoscaling set). About how long does it take to downscale when using densify? Also, would it be better to write out the mt and then repartition?