Export_vcf(): Invalid type for format field 'gvcf_info'

Hi,

After reading a matrix table built with run_combiner(),

mt = hl.read_matrix_table(matrix_table)                                                                                                                                                                   
mt2=hl.experimental.densify(mt)                                                                                                                                                            
hl.export_vcf(mt2, bucket + 'jc/interval_wes.vcf.bgz')  

the following error is raised:

hail.utils.java.FatalError: HailException: Invalid type for format field 'gvcf_info'. Found 'struct{BaseQRankSum: float64, DS: bool, ExcessHet: float64, InbreedingCoeff: float64, MLEAC: array<int32>, MLEAF: array<float64>, MQRankSum: float64, RAW_MQandDP: array<int32>, ReadPosRankSum: float64}'.

This is with Hail version: 0.2.49-b1db2c323727 .
full log: error.log (25.9 KB)

Should gvcf_info be reformatted?
Thank you for your help with this,
Guillaume

You’re definitely hitting the rough edges in the combiner algorithm/docs, thanks for being an early user!

This error message is good in that there’s not a way to export structs in VCF format fields. The easiest solution is probably to drop that field before exporting (mt = mt.drop('gvcf_info')). If you want to regenerate a normal-looking info field (as a MatrixTable row field to be exported to VCF INFO fields) you can use the gnomad utilities, especially this one:

after installing the gnomad lib (with --pkgs gnomad in hailctl dataproc start):

from gnomad.utils.sparse_mt import get_as_info_expr
mt = mt.annotate_rows(info = get_as_info_expr(mt)).drop('gvcf_info')
1 Like

Thank you, I’ll try that.

Maybe worth mentioning, I also needed to run run_combiner() with key_by_locus_and_alleles=True to get rid of a first export_vcf() error.

1 Like

almost there,

gives

  Traceback (most recent call last):
  File "/tmp/3d1cc5b85e184b49823d49bd542e9484/mt_to_vcf.py", line 12, in <module>
    mt = mt.annotate_rows(info = get_as_info_expr(mt)).drop('gvcf_info')
  File "/opt/conda/default/lib/python3.6/site-packages/gnomad/utils/sparse_mt.py", line 315, in get_as_info_expr
    prefix="AS_",
  File "/opt/conda/default/lib/python3.6/site-packages/gnomad/utils/sparse_mt.py", line 193, in _get_info_agg_expr
    sum_agg_fields = _agg_list_to_dict(mt, sum_agg_fields)
  File "/opt/conda/default/lib/python3.6/site-packages/gnomad/utils/sparse_mt.py", line 185, in _agg_list_to_dict
    ",".join(missing_fields)
ValueError: Could not find the following field(s)in the MT entry schema (or nested under mt.gvcf_info: QUALapprox
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [3d1cc5b85e184b49823d49bd542e9484] failed with error:
Job failed with message [ValueError: Could not find the following field(s)in the MT entry schema (or nested under mt.gvcf_info: QUALapprox]. Additional details can be found at:
https://console.cloud.google.com/dataproc/jobs/3d1cc5b85e184b49823d49bd542e9484?project=ibd-interval&region=europe-west2
gcloud dataproc jobs wait '3d1cc5b85e184b49823d49bd542e9484' --region 'europe-west2' --project 'ibd-interval'
https://console.cloud.google.com/storage/browser/dataproc-staging-europe-west2-244291339228-2kcmfoiu/google-cloud-dataproc-metainfo/330201c7-0e45-4c19-b4c6-c733981f2f2e/jobs/3d1cc5b85e184b49823d49bd542e9484/
gs://dataproc-staging-europe-west2-244291339228-2kcmfoiu/google-cloud-dataproc-metainfo/330201c7-0e45-4c19-b4c6-c733981f2f2e/jobs/3d1cc5b85e184b49823d49bd542e9484/driveroutput
Traceback (most recent call last):
  File "/lustre/scratch118/humgen/resources/conda_envs/hail_google/bin/hailctl", line 8, in <module>
    sys.exit(main())
  File "/lustre/scratch118/humgen/resources/conda_envs/hail_google/lib/python3.7/site-packages/hailtop/hailctl/__main__.py", line 100, in main
    cli.main(args)
  File "/lustre/scratch118/humgen/resources/conda_envs/hail_google/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py", line 108, in main
    jmp[args.module].main(args, pass_through_args)
  File "/lustre/scratch118/humgen/resources/conda_envs/hail_google/lib/python3.7/site-packages/hailtop/hailctl/dataproc/submit.py", line 78, in main
    check_call(cmd)
  File "/lustre/scratch118/humgen/resources/conda_envs/hail_google/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'jobs', 'submit', 'pyspark', 'mt_to_vcf.py', '--cluster=hailwes', '--files=', '--py-files=/tmp/pyscripts_mjbn_oun.zip', '--properties=spark.speculation=true']' returned non-zero exit status 1.

Ah, we’ve seen this before, I think it’s that these functions were written against a specific schema.

You can pass sum_agg_fields=(), I think.

I was able to export to vcf using

mt = mt.drop('gvcf_info')

However,

mt = mt.annotate_rows(info = get_as_info_expr(mt, sum_agg_fields=())).drop('gvcf_info')

failed with

   Traceback (most recent call last):
  File "/tmp/2308164ba42242e389dfef99b52b87c3/mt_to_vcf_info.py", line 14, in <module>
    mt = mt.annotate_rows(info = get_as_info_expr(mt, int32_sum_agg_fields=())).drop('gvcf_info')
  File "/opt/conda/default/lib/python3.6/site-packages/gnomad/utils/sparse_mt.py", line 315, in get_as_info_expr
    prefix="AS_",
  File "/opt/conda/default/lib/python3.6/site-packages/gnomad/utils/sparse_mt.py", line 193, in _get_info_agg_expr
    sum_agg_fields = _agg_list_to_dict(mt, sum_agg_fields)
  File "/opt/conda/default/lib/python3.6/site-packages/gnomad/utils/sparse_mt.py", line 185, in _agg_list_to_dict
    ",".join(missing_fields)
ValueError: Could not find the following field(s)in the MT entry schema (or nested under mt.gvcf_info: QUALapprox
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [2308164ba42242e389dfef99b52b87c3] failed with error:
Job failed with message [ValueError: Could not find the following field(s)in the MT entry schema (or nested under mt.gvcf_info: QUALapprox]. Additional details can be found at:

This is your code, right? I think that needs to be sum_agg_fields, not int32_sum_agg_fields

Sorry, I had tried both sum_agg_fields=() and int32_sum_agg_fields=() . The former gave:

Traceback (most recent call last):
  File "/tmp/2c8e1b74932d4506b13ffc9e9d87845d/mt_to_vcf_info.py", line 14, in <module>
    mt = mt.annotate_rows(info = get_as_info_expr(mt, sum_agg_fields=())).drop('gvcf_info')
  File "/opt/conda/default/lib/python3.6/site-packages/gnomad/utils/sparse_mt.py", line 315, in get_as_info_expr
    prefix="AS_",
  File "/opt/conda/default/lib/python3.6/site-packages/gnomad/utils/sparse_mt.py", line 196, in _get_info_agg_expr
    int32_sum_agg_fields = _agg_list_to_dict(mt, int32_sum_agg_fields)
  File "/opt/conda/default/lib/python3.6/site-packages/gnomad/utils/sparse_mt.py", line 185, in _agg_list_to_dict
    ",".join(missing_fields)
ValueError: Could not find the following field(s)in the MT entry schema (or nested under mt.gvcf_info: VarDP
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [2c8e1b74932d4506b13ffc9e9d87845d] failed with error:
Job failed with message [ValueError: Could not find the following field(s)in the MT entry schema (or nested under mt.gvcf_info: VarDP]. Additional details can be found at:

Ah – might need to pass both! I think the gnomad functions are pretty specialized to the current GATK schema.

with both it gave

Traceback (most recent call last):
  File "/tmp/de570704c949491293e6c1a3eb61208c/mt_to_vcf_info.py", line 14, in <module>
    mt = mt.annotate_rows(info = get_as_info_expr(mt, sum_agg_fields=(), int32_sum_agg_fields=())).drop('gvcf_info')
  File "/opt/conda/default/lib/python3.6/site-packages/gnomad/utils/sparse_mt.py", line 315, in get_as_info_expr
    prefix="AS_",
  File "/opt/conda/default/lib/python3.6/site-packages/gnomad/utils/sparse_mt.py", line 214, in _get_info_agg_expr
    {f"{prefix}{k}": hl.agg.sum(expr) for k, expr in sum_agg_fields.items()}
AttributeError: 'tuple' object has no attribute 'items'

Ah, didn’t read this code close enough – pass either [] or {}, instead of ().

unfortunately, with the added [] it gave

INFO (gnomad.utils.sparse_mt 235): Computing AS_MQ as sqrt(AS_RAW_MQandDP[0]/AS_RAW_MQandDP[1]). Note that AS_MQ will be set to 0 if AS_RAW_MQandDP[1] == 0.
read MT done
densify MT
densify MT done
run export to vcf.bgz
[Stage 0:====================================================>(2373 + 1) / 2374]Traceback (most recent call last):
  File "/tmp/a90813e7f245467cbfca964ae409178f/mt_to_vcf_info.py", line 24, in <module>
    hl.export_vcf(mt2, bucket + 'jc/interval_wes_withinfo.vcf.bgz')
  File "<decorator-gen-1377>", line 2, in export_vcf
  File "/opt/conda/default/lib/python3.6/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.6/site-packages/hail/methods/impex.py", line 525, in export_vcf
    Env.backend().execute(ir.MatrixWrite(dataset._mir, writer))
  File "/opt/conda/default/lib/python3.6/site-packages/hail/backend/spark_backend.py", line 296, in execute
    result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/conda/default/lib/python3.6/site-packages/hail/backend/spark_backend.py", line 41, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
hail.utils.java.FatalError: HailException: INFO field 'AS_SB_TABLE': VCF does not support type 'array<array<int32>>'.

Java stack trace:
is.hail.utils.HailException: INFO field 'AS_SB_TABLE': VCF does not support type 'array<array<int32>>'.
	at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:9)
	at is.hail.utils.package$.fatal(package.scala:77)
	at is.hail.io.vcf.ExportVCF$.infoType(ExportVCF.scala:122)
	at is.hail.io.vcf.ExportVCF$$anonfun$header$1$4.apply(ExportVCF.scala:297)
	at is.hail.io.vcf.ExportVCF$$anonfun$header$1$4.apply(ExportVCF.scala:290)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at is.hail.io.vcf.ExportVCF$.header$1(ExportVCF.scala:290)
	at is.hail.io.vcf.ExportVCF$.apply(ExportVCF.scala:455)
	at is.hail.expr.ir.MatrixVCFWriter.apply(MatrixWriter.scala:314)
	at is.hail.expr.ir.WrappedMatrixWriter.apply(MatrixWriter.scala:42)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:811)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:56)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:51)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:14)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:12)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
	at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:318)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:305)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:304)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:20)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:18)
	at is.hail.utils.package$.using(package.scala:602)
	at is.hail.annotations.Region$.scoped(Region.scala:18)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:18)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:230)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:304)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:324)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.49-b1db2c323727
Error summary: HailException: INFO field 'AS_SB_TABLE': VCF does not support type 'array<array<int32>>'.
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [a90813e7f245467cbfca964ae409178f] failed with error

If that is hard to debug, I don’t want to waste your time! I got the vcf without the info field.

We only support certain types in export_vcf (not nested arrays, like AS_SB_TABLE here), so the solution is to either convert this field to something supported (like array) or drop it.