Help for finding rare variants for 100 patients

Hello team,

Can you please help me for an efficient way to find common rare variants among 100 patients? I have 100 gVCF files (each file represents a patient). Currently, my workflow is below:

  1. Import the gVCF file into a MatrixTable:
    mt1 = hl.import_vcf(‘xxx.g.vcf.gz’, force_bgz=True, reference_genome=“GRCh38”)
  2. Filtering some gene locations defined by intervals
    fmt = hl.filter_intervals(mt1,intervals)
  3. Add the annotation from gnomad to get the allele frequency data
    db = hl.experimental.DB(region=‘us’, cloud=‘aws’)
    ann_fmt = db.annotate_rows_db(fmt, ‘gnomad_genome_sites’)
  4. Perform further filtering by the allele frequency
    rare_mt = ann_fmt.filter_rows(ann_fmt.gnomad_genome_sites.freq[0].AF < 0.001)
  5. extract the locus and alleles key from rare_mt
  6. repeat 1-5 for every patient and find the intersection of the rare variants.

Thanks for helping me on the Steps 3 and 4 above in the other topic that I posted a couple of weeks ago. I was able to do the above steps 1-5 for one patient, but it took a lot of time (about 30-40 mins on the spark cluster with 8 nodes). Instead of writing a loop to go through the above steps 1-5 for each gVCF file (each patient), is there an efficient way to accomplish the tasks above? Your help is highly appreciated.

I use import_gvcfs (Hail | Import / Export) which can take multiple gVCF files and a partitions argument to filter to specific regions while loading the data.

mts = hl.import_gvcfs(
        vcf_files,
        partitions=regions,
        reference_genome="GRCh38",
        array_elements_required=False,
    )

    # union multiple matrixtables before returning them

    if len(mts) == 1:
        return mts[0]
    else:
        return hl.MatrixTable.union_rows(*mts)

I think checkpointing after this makes subsequent operations faster but someone can correct me on this.

Thanks very much for the advice!
I tried the import_gvcfs function:

mts = hl.import_gvcfs(
paths,
partitions=intervals,
reference_genome=“GRCh38”,
array_elements_required=False,
)

I got the error below:
FatalError: HailException: range bounds must be inclusive

I checked the intervals:
[Interval(start=Locus(contig=chr1, position=18182239, reference_genome=GRCh38), end=Locus(contig=chr1, position=21299915, reference_genome=GRCh38), includes_start=True, includes_end=False),
Interval(start=Locus(contig=chr1, position=25196014, reference_genome=GRCh38), end=Locus(contig=chr1, position=28282103, reference_genome=GRCh38), includes_start=True, includes_end=False),
…]

I can see that for each interval the includes_end=False. Is there some function or method available to change this to True?

I created the intervals using the code below:
intervals=[ ]
for i in range(0,88):
intervals.append(hl.parse_locus_interval(GenePosition.iloc[i,7],reference_genome=‘GRCh38’))

where an example of GenePosition.iloc[1,7] is chr16:28162962-31170875.
Thanks.

The start locus must precede the end locus. The default bounds of the interval are left-inclusive and right-exclusive. To change this, add one of [ or ( at the beginning of the string for left-inclusive or left-exclusive respectively. Likewise, add one of ] or ) at the end of the string for right-inclusive or right-exclusive respectively.

https://hail.is/docs/0.2/functions/genetics.html#hail.expr.functions.parse_locus_interval

Thanks for the tip of fixing the intervals. I got my intervals fixed and re-ran the import_gvcfs function. I got another error this time as below:

FatalError Traceback (most recent call last)
in
3 partitions=intervals,
4 reference_genome=“GRCh38”,
----> 5 array_elements_required=False,
6 )

in import_gvcfs(path, partitions, force, force_bgz, call_fields, entry_float_type, reference_genome, contig_recoding, array_elements_required, skip_invalid_loci, filter, find_replace, _external_sample_ids, _external_header)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
575 def wrapper(original_func, *args, **kwargs):
576 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
→ 577 return original_func(*args, **kwargs)
578
579 return wrapper

/opt/conda/lib/python3.6/site-packages/hail/methods/impex.py in import_gvcfs(path, partitions, force, force_bgz, call_fields, entry_float_type, reference_genome, contig_recoding, array_elements_required, skip_invalid_loci, filter, find_replace, _external_sample_ids, _external_header)
2447 find_replace[1] if find_replace is not None else None,
2448 _external_sample_ids,
→ 2449 _external_header)
2450 vector_ref = json.loads(vector_ref_s)
2451 jir_vref = ir.JIRVectorReference(vector_ref[‘vector_ir_id’],

/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in call(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
→ 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
29 raise FatalError(‘%s\n\nJava stack trace:\n%s\n’
30 ‘Hail version: %s\n’
—> 31 ‘Error summary: %s’ % (deepest, full, hail.version, deepest), error_id) from None
32 except pyspark.sql.utils.CapturedException as e:
33 raise FatalError(‘%s\n\nJava stack trace:\n%s\n’

FatalError: IllegalArgumentException: requirement failed

Java stack trace:
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:212)
at is.hail.rvd.RVDPartitioner.(RVDPartitioner.scala:52)
at is.hail.rvd.RVDPartitioner.(RVDPartitioner.scala:40)
at is.hail.io.vcf.VCFsReader.(LoadVCF.scala:1888)
at is.hail.backend.spark.SparkBackend$$anonfun$pyImportVCFs$1$$anonfun$apply$11.apply(SparkBackend.scala:536)
at is.hail.backend.spark.SparkBackend$$anonfun$pyImportVCFs$1$$anonfun$apply$11.apply(SparkBackend.scala:535)
at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:638)
at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:47)
at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:46)
at is.hail.utils.package$.using(package.scala:638)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:46)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
at is.hail.backend.spark.SparkBackend$$anonfun$pyImportVCFs$1.apply(SparkBackend.scala:535)
at is.hail.backend.spark.SparkBackend$$anonfun$pyImportVCFs$1.apply(SparkBackend.scala:534)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
at is.hail.backend.spark.SparkBackend.pyImportVCFs(SparkBackend.scala:534)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)

Hail version: 0.2.78-b17627756568
Error summary: IllegalArgumentException: requirement failed

I don’t know what this error means and how to fix it. Your help is highly appreciated!

@shengwei66 Can you update to the latest version of Hail? That version of Hail is over a year old. Unfortunately the team has limited resources and cannot support old versions of Hail. The latest version is 0.2.109.

Thanks Dan for your reply. I am running the analysis on the DNAnexus platform, and I am not sure if the Hail version can be updated to the latest version.

Maybe I will need to look for other approach of importing the gvcf files. How about the vds.combiner? Is this function available in Hail 0.2.78? Thanks.

The VDS Combiner uses the same subsystem.

I’m quite confident you’re running into a known bug from late 2021 / early 2022 which was fixed on February 22, 2022 in PR 11355. There’s no way to work around this, versions of Hail before 0.2.85 just had a bug in hl.import_gvcfs.

I think the most valuable thing you can do is contact DNANexus. They need to hear directly from users the importance of regularly updating Hail. Hail is an evolving piece of software with new features, faster implementations, and bug fixes being released every week.

FWIW, two of DNANexus’ competitors, Terra and the All of Us RWB, both regularly update Hail. Those platforms should generally be no more than a month out of date.

Thanks Dan for your advice. I will contact DNANexus for requesting the update of Hail.

1 Like

Hi,
I’m getting a very similar error. I most say I just restarted using hail after a while and I had to upgrade hail. This is the current setup:
Name: hail
Version: 0.2.112
Name: pyspark
Version: 3.3.2

Using hailctl to launch an google cloud instance, then after connecting to the notebook I get this error immediately after trying hail.ini()
I’m guessing I need to upgrade something else to make it compatible again?? help is appreciated, I’m stuck!

Trace

Py4JJavaError Traceback (most recent call last)
in <cell line: 1>()
----> 1 hl.init()

in init(sc, app_name, master, local, log, quiet, append, min_block_size, branching_factor, tmp_dir, default_reference, idempotent, global_seed, spark_conf, skip_logging_configuration, local_tmpdir, _optimizer_iterations, backend, driver_cores, driver_memory, worker_cores, worker_memory, gcs_requester_pays_configuration, regions)

/opt/conda/miniconda3/lib/python3.10/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
575 def wrapper(original_func, *args, **kwargs):
576 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
→ 577 return original_func(*args, **kwargs)
578
579 return wrapper

/opt/conda/miniconda3/lib/python3.10/site-packages/hail/context.py in init(sc, app_name, master, local, log, quiet, append, min_block_size, branching_factor, tmp_dir, default_reference, idempotent, global_seed, spark_conf, skip_logging_configuration, local_tmpdir, _optimizer_iterations, backend, driver_cores, driver_memory, worker_cores, worker_memory, gcs_requester_pays_configuration, regions)
346 ))
347 if backend == ‘spark’:
→ 348 return init_spark(
349 sc=sc,
350 app_name=app_name,

in init_spark(sc, app_name, master, local, log, quiet, append, min_block_size, branching_factor, tmp_dir, default_reference, idempotent, global_seed, spark_conf, skip_logging_configuration, local_tmpdir, _optimizer_iterations, gcs_requester_pays_configuration)

/opt/conda/miniconda3/lib/python3.10/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
575 def wrapper(original_func, *args, **kwargs):
576 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
→ 577 return original_func(*args, **kwargs)
578
579 return wrapper

/opt/conda/miniconda3/lib/python3.10/site-packages/hail/context.py in init_spark(sc, app_name, master, local, log, quiet, append, min_block_size, branching_factor, tmp_dir, default_reference, idempotent, global_seed, spark_conf, skip_logging_configuration, local_tmpdir, _optimizer_iterations, gcs_requester_pays_configuration)
425 app_name = app_name or ‘Hail’
426 gcs_requester_pays_project, gcs_requester_pays_buckets = convert_gcs_requester_pays_configuration_to_hadoop_conf_style(gcs_requester_pays_configuration)
→ 427 backend = SparkBackend(
428 idempotent, sc, spark_conf, app_name, master, local, log,
429 quiet, append, min_block_size, branching_factor, tmpdir, local_tmpdir,

/opt/conda/miniconda3/lib/python3.10/site-packages/hail/backend/spark_backend.py in init(self, idempotent, sc, spark_conf, app_name, master, local, log, quiet, append, min_block_size, branching_factor, tmpdir, local_tmpdir, skip_logging_configuration, optimizer_iterations, gcs_requester_pays_project, gcs_requester_pays_buckets)
182 self._jbackend, log, True, append, branching_factor, skip_logging_configuration, optimizer_iterations)
183 else:
→ 184 self._jbackend = hail_package.backend.spark.SparkBackend.apply(
185 jsc, app_name, master, local, True, min_block_size, tmpdir, local_tmpdir,
186 gcs_requester_pays_project, gcs_requester_pays_buckets)

/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py in call(self, *args)
1319
1320 answer = self.gateway_client.send_command(command)
→ 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1323

/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
→ 326 raise Py4JJavaError(
327 “An error occurred while calling {0}{1}{2}.\n”.
328 format(target_id, “.”, name), value)

Py4JJavaError: An error occurred while calling z:is.hail.backend.spark.SparkBackend.apply.
: java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:268)
at is.hail.backend.spark.SparkBackend$.apply(SparkBackend.scala:227)
at is.hail.backend.spark.SparkBackend.apply(SparkBackend.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)

I’m not sure what you’re doing exactly, but this error means spark is already running. You probably need to do hl.init(sc=spark.sparkContext()) or something like that.