Unable to do sample/variant QC after combining MatrixTable

I am trying to run a GWAS using Hail and am running into some problems when trying to run the analyses by chromosome. I have already converted each of the chromosome VCF files into their corresponding MatrixTables. When I load a single chromosome via the following code, everything works fine:

path_to_matrix_table = f"{bucket}/data/chr_1.mt"
mt = hl.read_matrix_table(path_to_matrix_table)

path_to_phenotype = f"{bucket}/data/phenotypes.tsv"
table = hl.import_table(path_to_phenotype, types={‘year_birth’: hl.tint32, ‘sex_binary’: hl.tint32}, delimiter=‘\t’)).key_by(“sample”)

Combine the phenotype data into the Matrix Table

mt = mt.annotate_cols(pheno = table[mt.s])

Call built in QC methods

mt = hl.sample_qc(mt)
mt = hl.variant_qc(mt)

Filter samples for missingness

mt = mt.filter_cols((mt.sample_qc.call_rate >= 0.97))
print(‘After filter, %d/%d samples remain.’ % (mt.count_cols(), initial_num_samples))

Filter out variants with allele frequency lower than 0.03 and HWE value less than 1e-6

mt = mt.filter_rows((mt.variant_qc.AF[1] > 0.03) & (mt.variant_qc.p_value_hwe > 1e-6))
print(‘After filter, %d/%d variants remain.’ % (mt.count_rows(), initial_num_variants))

Perform GWAS using logistic regression for phenotype of sex with no covariates

gwas = hl.logistic_regression_rows(test=‘wald’, y=mt.pheno.sex_binary,
x=mt.GT.n_alt_alleles(),
covariates=[1.0])
gwas.row.describe()

However, when I instead try and create a combined matrix table via:

Load and combine multiple matrix tables from workspace bucket

Create a empty list in which individual MatrixTable’s for each chromosome will be stored

chr_mt_list =

Retrieve MatrixTables for specified range of chromosomes and append them to list (chr1 and chr2)

for i in range(1, 3):
curr_path = f"{bucket}/data/chr_{i}.mt"
curr_chr_mt = hl.read_matrix_table(curr_path)
chr_mt_list.append(curr_chr_mt)

Combine all MatrixTables into one MatrixTable and print out number of samples (cols) and number of variants (rows)

mt = hl.MatrixTable.union_rows(*chr_mt_list)
initial_num_variants = mt.count_rows()
initial_num_samples = mt.count_cols()
print(‘Samples: %d Variants: %d’ % (mt.count_cols(), mt.count_rows()))

and then try to run the QC methods (i.e. mt = mt.filter_rows((mt.variant_qc.AF[1] > 0.03) & (mt.variant_qc.p_value_hwe > 1e-6))), the code gets stuck and doesn’t update at all. The only displayed message is: [Stage 2:> (0 + 1) / 1]. And this does not update.

What should I do?

This message means that Hail is running a Spark job with only one task (no parallelism). Can you add the following:

for i in range(1, 3):
    curr_path = f"{bucket}/data/chr_{i}.mt"
    curr_chr_mt = hl.read_matrix_table(curr_path)
    print(f'mt {i} has {curr_chr_mt.n_partitions()} partitions')
    chr_mt_list.append(curr_chr_mt)

mt = hl.MatrixTable.union_rows(*chr_mt_list)
print(f'union mt {i} has {mt.n_partitions()} partitions')

I tried adding the print statements you recommended and got the following output:

mt 1 has 1024 partitions
mt 2 has 1024 partitions
union mt 2 has 2048 partitions

So it seems to combine the number of partitions, but Spark is not parallelizing them once you try to run some QC? Or is something else going on?

After doing a bit more digging, I also got an error trying to get the number of partitions after adding in the phenotype data. Perhaps, Hail is having trouble determining the number of partitions after annotating the columns with phenotype information?

Code below:

# Load and combine multiple matrix tables from workspace bucket

# Create a empty list in which individual MatrixTable's for each chromosome will be stored
chr_mt_list = []

# Retrieve MatrixTables for specified range of chromosomes and append them to list
for i in range(1, 3):
    curr_path = f"{bucket}/data/chr_{i}.mt"
    curr_chr_mt = hl.read_matrix_table(curr_path)
    print(f'mt {i} has {curr_chr_mt.n_partitions()} partitions')
    chr_mt_list.append(curr_chr_mt)

# Combine all MatrixTables into one MatrixTable and print out number of samples (cols) and number of variants (rows)
mt = hl.MatrixTable.union_rows(*chr_mt_list)
print(f'union mt {i} has {mt.n_partitions()} partitions')

# Load phenotype data from workspace bucket
path_to_phenotype = f"{bucket}/data/phenotypes.tsv"
table = (hl.import_table(path_to_phenotype, types={'year_birth': hl.tint32, 'sex_binary': hl.tint32}, delimiter='\t')).key_by("sample")
table.describe()

mt = mt.annotate_cols(pheno = table[mt.s])
mt.describe()
print(mt.n_partitions())

Error Message:

---------------------------------------------------------------------------
HailUserError                             Traceback (most recent call last)
/tmp/ipykernel_1017/3289034321.py in <module>
      2 mt = mt.annotate_cols(pheno = table[mt.s])
      3 mt.describe()
----> 4 print(mt.n_partitions())

/opt/conda/lib/python3.7/site-packages/hail/matrixtable.py in n_partitions(self)
   3324             Number of partitions.
   3325         """
-> 3326         return Env.backend().execute(ir.MatrixToValueApply(self._mir, {'name': 'NPartitionsMatrixTable'}))
   3327 
   3328     @typecheck_method(n_partitions=int,

/opt/conda/lib/python3.7/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
    103             return (value, timings) if timed else value
    104         except FatalError as e:
--> 105             raise e.maybe_user_error(ir) from None
    106 
    107     async def _async_execute(self, ir, timed=False):

HailUserError: Error summary: HailException: cannot parse int32 from input string '.'
------------
Hail stack trace:
  File "/tmp/ipykernel_1017/3655693752.py", line 3, in <module>
    table = (hl.import_table(path_to_phenotype, types={'year_birth': hl.tint32, 'sex_binary': hl.tint32}, delimiter='\t')).key_by("sample")

  File "/opt/conda/lib/python3.7/site-packages/hail/methods/impex.py", line 1722, in import_table
    fields_to_value[field] = parse_type(ht.split_text[f_idx], t)

  File "/opt/conda/lib/python3.7/site-packages/hail/ir/utils.py", line 133, in parse_type
    return hl.int32(string_expr)

  File "/opt/conda/lib/python3.7/site-packages/hail/expr/functions.py", line 5316, in int32
    return x._method("toInt32", tint32)

  File "/opt/conda/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py", line 695, in _method
    x = ir.Apply(name, ret_type, self._ir, *(a._ir for a in args))

  File "/opt/conda/lib/python3.7/site-packages/hail/ir/ir.py", line 2740, in __init__
    self.save_error_info()

You need to specify import_table(..., missing='.') to tell Hail that . means NA. By default, Hail expects missing values to be represented with the string NA.

Adding in the missing = “.” was able to fix the issue with retrieving the number of partitions, but I am still getting the original problem of Hail being stuck when trying to perform some QC operations. I decided to calculate the number of partitions after calling the sample_qc and variant_qc functions and it was also stuck there.

Code below:

# Load and combine multiple matrix tables from workspace bucket

# Create a empty list in which individual MatrixTable's for each chromosome will be stored
chr_mt_list = []

# Retrieve MatrixTables for specified range of chromosomes and append them to list
for i in range(1, 3):
    curr_path = f"{bucket}/data/chr_{i}.mt"
    curr_chr_mt = hl.read_matrix_table(curr_path)
    print(f'mt {i} has {curr_chr_mt.n_partitions()} partitions')
    chr_mt_list.append(curr_chr_mt)

# Combine all MatrixTables into one MatrixTable and print out number of samples (cols) and number of variants (rows)
mt = hl.MatrixTable.union_rows(*chr_mt_list)
print(f'union mt {i} has {mt.n_partitions()} partitions') # This prints out 1024, 1024, and then 2048

# Load phenotype data from workspace bucket
path_to_phenotype = f"{bucket}/data/phenotypes.tsv"
table = (hl.import_table(path_to_phenotype, types={'year_birth': hl.tint32, 'sex_binary': hl.tint32}, delimiter='\t', missing='.')).key_by("sample")

# Combine the phenotype data into the Matrix Table
mt = mt.annotate_cols(pheno = table[mt.s])
print(mt.n_partitions()) # This prints out 2048

# Call Hail's QC functions on the samples and variants
mt = hl.sample_qc(mt)
mt = hl.variant_qc(mt)
mt.describe()
print(mt.n_partitions()) # Here is where the code gets stuck

What gets shown after trying to get partitions: [Stage 4:> (0 + 1) / 1]

Do I have to specify something in the sample_qc and variant_qc functions?

The issue is almost certainly this line:

table = (hl.import_table(path_to_phenotype, types={'year_birth': hl.tint32, 'sex_binary': hl.tint32}, delimiter='\t', missing='.')).key_by("sample")

It’s best to write hail expressions on multiple lines for clarity:

table = hl.import_table(
    path_to_phenotype,
    types={'year_birth': hl.tint32, 'sex_binary': hl.tint32},
    delimiter='\t',
    missing='.'
)
table = table.key_by("sample")

Notice the key_by. This requires a shuffle (aka reordering) of the data. You need to isolate shuffles from other operations with a write and a read. Try this:

table.write('gs://bucket/....ht')
table = hl.read_table('gs://bucket/....ht')

I incorporated your suggestions and it still gets stuck when trying to get the number of partitions after calling the QC functions.

Updated code:

# Import and initialize Hail
import hail as hl
hl.init()
from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()

# Load and combine multiple matrix tables from workspace bucket

# Create a empty list in which individual MatrixTable's for each chromosome will be stored
chr_mt_list = []

# Retrieve MatrixTables for specified range of chromosomes and append them to list
for i in range(1, 3):
    curr_path = f"{bucket}/data/chr_{i}.mt"
    curr_chr_mt = hl.read_matrix_table(curr_path)
    print(f'mt {i} has {curr_chr_mt.n_partitions()} partitions')
    chr_mt_list.append(curr_chr_mt)

# Combine all MatrixTables into one MatrixTable and print out number of samples (cols) and number of variants (rows)
mt = hl.MatrixTable.union_rows(*chr_mt_list)
print(f'union mt {i} has {mt.n_partitions()} partitions')

# Load phenotype data from workspace bucket
path_to_phenotype = f"{bucket}/data/phenotypes.tsv"

table = hl.import_table(
    path_to_phenotype,
    types={'year_birth': hl.tint32, 'sex_binary': hl.tint32},
    delimiter='\t',
    missing='.'
)
table = table.key_by("sample")

# Write Hail table containing phenotypes out after reshuffling and then re-read in the phenotype info
table.write(f"{bucket}/data/phenotype.ht")
table = hl.read_table(f"{bucket}/data/phenotype.ht")

table.describe()

# Combine the phenotype data into the Matrix Table
mt = mt.annotate_cols(pheno = table[mt.s])
mt.describe()
print(mt.n_partitions()) # prints out 2048

# Call Hail's QC functions on the samples and variants
mt = hl.sample_qc(mt)
mt = hl.variant_qc(mt)
mt.describe()
print(mt.n_partitions()) # Get's stuck here with the following log message: [Stage 6:>                                                          (0 + 1) / 1]

I wonder if it has anything to do with the number of partitions of the phenotype Hail table and not the Hail MatrixTable? From my understanding, the write operation of the Hail table does not have a parameter to specify number of partitions, but it does have a repartition function.

Though I do not know how that would impact the MatrixTable after we annotate the columns with the phenotype information.

This is mysterious. I wouldn’t except the code you shared to have any issue. That’s basic Hail functionality. The partitioning of the sample metadata table should have no impact. In fact, n_partitions shouldn’t even trigger a stage. Can you share the hail log file? You might need to upload it to gist.github.com if its too big.

Sure! I just messaged you the log file I believe. Let me know if you need anything else!

Were you able to take a look at the log file? I can upload it elsewhere if that is easier. Thanks!