I’m trying to densify a sparse mt of over 200,000 samples and 2787 partitions. This is the code I’m trying to run:
hl.init(default_reference='GRCh38', log='/densify_sparse_mt.log')
freeze = args.freeze
logger.info('Reading in input mt (raw sparse mt)')
mt = hl.read_matrix_table(args.input)
logger.info(f'Sparse mt count: {mt.count()}')
logger.info('Densifying mt')
mt = hl.experimental.densify(mt)
logger.info('Filtering out lines that are only reference')
mt = mt.filter_rows(hl.len(mt.alleles) > 1)
logger.info(f'Count after filtration: {mt.count()}')
logger.info('Writing out mt')
mt = mt.repartition(30000)
mt.write(raw_mt_path('broad', freeze), args.overwrite)
This has been running for over 17 hours, and it hasn’t downscaled (the cluster has autoscaling set). About how long does it take to downscale when using densify
? Also, would it be better to write out the mt and then repartition?