Linear regression hanging - help needed

Hi Dan,

Thanks for keeping the suggestions coming. There’s already a line removing rows with NAs as the gene symbol. This amounts to 3,463/10,441 = 33% of the rows in the HGDP table:

burden = burden.filter_rows( hl.is_missing(burden.gene_symbol), keep=False )

However, I moved the filtering up to the step where I import the HGDP table itself, since until now it has been done after the grouping and aggregation steps. If the NA rows are going to be removed anyway, they may as well be removed prior to a costly grouping & aggregation step, right?

The print(burden.n_partitions()) line prints 1, which supports your statement that it is collapsing to a single partition from multiple. The n_partitions() call runs for several minutes, printing the same status message as the linear regression did in the previous run (last line below):

[Stage 16:===================================================>(2585 + 1) / 2586]

Another thing I tried was to add a call to <table>.show() after every table modification to isolate the cause of the hang. (The only problem with this is that I may be introducing inefficiency by forcing Spark to do each stage in the order written, rather than having some license to rearrange for better efficiency.)

Anyway: The step where it is hanging seems to be:

burden = ( genomes.group_rows_by(genomes.vep_info.gene_symbol).aggregate(n_variants = hl.agg.count_where(genomes.GT.n_alt_alleles() > 0)))

Do you think it would help if I broke this multistep operation into several discrete steps?

Thanks,
Daniel