Ht.write() throw NumberFormatException when I handled missing value

Hi,

I’m trying to merge two dataset together, and using “hl.null(‘float64’)” to handle missing value. It looks find as below:

In [9]: combined.filter(hl.is_missing(combined.CADD16snv_PHRED)).show(3)
+---------------+----------------+--------------+-----------------+------------+---------+----------+----------------+--------------+-----------------------+
| locus         | alleles        | MetaSVM_pred | CADD16snv_PHRED | CADD_phred |     MPC |    bravo | Exonic_refGene | Func_refGene | combined_nontopmed_AC |
+---------------+----------------+--------------+-----------------+------------+---------+----------+----------------+--------------+-----------------------+
| locus<GRCh37> | array<str>     | str          |         float64 |    float64 | float64 |  float64 | str            | str          |                 int64 |
+---------------+----------------+--------------+-----------------+------------+---------+----------+----------------+--------------+-----------------------+
| 5:11628       | ["TA","T"]     | NA           |              NA |         NA |      NA | 0.00e+00 | NA             | "intergenic" |                    15 |
| 5:11628       | ["TAACC","T"]  | NA           |              NA |         NA |      NA | 0.00e+00 | NA             | "intergenic" |                     0 |
| 5:11628       | ["TAACCC","T"] | NA           |              NA |         NA |      NA | 0.00e+00 | NA             | "intergenic" |                     2 |
+---------------+----------------+--------------+-----------------+------------+---------+----------+----------------+--------------+-----------------------+

+-----------------------+----------------------------+------------+------------+---------------+----------------+-------------+----------------+
| combined_nontopmed_AN | combined_nontopmed_nhomalt | exome_rsid | exome_qual | exome_filters | genome_rsid    | genome_qual | genome_filters |
+-----------------------+----------------------------+------------+------------+---------------+----------------+-------------+----------------+
|                 int32 |                      int64 | str        |    float64 | set<str>      | str            |     float64 | set<str>       |
+-----------------------+----------------------------+------------+------------+---------------+----------------+-------------+----------------+
|                   386 |                          0 | NA         |         NA | NA            | "rs1193073303" |    1.53e+04 | {}             |
|                   386 |                          0 | NA         |         NA | NA            | "rs1193073303" |    1.53e+04 | {}             |
|                   386 |                          0 | NA         |         NA | NA            | "rs1193073303" |    1.53e+04 | {}             |
+-----------------------+----------------------------+------------+------------+---------------+----------------+-------------+----------------+

The NA can be recognized by is_missing(), but when I try to write out this hail table, I got an ‘NumberFormatException: For input string: “NA”’.

Do you have any suggestion for how to deal with this issue or how should I do for handle my missing values?

Best,

Po-Ying

Hey @poyingfu !

I’m sorry you’re having trouble with Hail! Can you share the code you executed before In [9]? I can’t figure out the issue without a bit more information. Thanks!

Hi @danking

Of course!
Thanks for the helping. I have few steps as below:

In [1]: import hail as hl
   ...: hl.init()
   ...: 
   ...: # VARs:
   ...: chrom = 5
   ...: out_dir = "/.../gnomad/MetricsTable"
   ...: #Import Combind gnomAD exome/genome Hail Table (hail.table.Table):
   ...: combined = hl.read_table('{}/combined.filtered.gnomad.r2.1.1.sites.{}.ht'.format(out_dir,chrom))

In [2]: #Check for missing MetaSVM, CADD, or Bravo frequencies. If exome missing use genome freqs and vice versa.
   ...: #Exome using info; Genome using info_1.
   ...: #Joining of the tables causes NA values for positions missing in either dataset. 
   ...: #Genomes have some positions missing in Exomes..etc
   ...: # MetaSVM_pred:
   ...: combined = combined.annotate(MetaSVM_pred=hl.if_else(hl.is_missing(combined.info.MetaSVM_pred) == True, combined.info_1.MetaSVM_pred, combined.info.Me
   ...: taSVM_pred))
   ...: # CADD16snv_PHRED:
   ...: combined = combined.annotate(CADD16snv_PHRED=hl.if_else(hl.is_missing(combined.info.CADD16snv_PHRED) == True, hl.if_else(hl.is_missing(combined.info_1
   ...: .CADD16snv_PHRED) == True, hl.null('float64'), hl.float64(combined.info_1.CADD16snv_PHRED)), hl.float64(combined.info.CADD16snv_PHRED)))
   ...: # CADD_phred: (CADD13)
   ...: combined = combined.annotate(CADD_phred=hl.if_else(hl.is_missing(combined.info.CADD_phred) == True, hl.if_else(hl.is_missing(combined.info_1.CADD_phre
   ...: d) == True, hl.null('float64'), hl.float64(combined.info_1.CADD_phred)), hl.float64(combined.info.CADD_phred)))
   ...: # MPC:
   ...: combined = combined.annotate(info=combined.info.annotate(MPC = hl.float64(combined.info.MPC[0])))
   ...: combined = combined.annotate(info_1=combined.info_1.annotate(MPC = hl.float64(combined.info_1.MPC[0])))
   ...: combined = combined.annotate(MPC=hl.if_else(hl.is_missing(combined.info.MPC) == True, hl.if_else(hl.is_missing(combined.info_1.MPC) == True, hl.null('
   ...: float64'), hl.float64(combined.info_1.MPC)),hl.float64(combined.info.MPC)))
   ...: # BRAVO:
   ...: combined = combined.annotate(bravo=hl.if_else(hl.is_missing(combined.info.bravo) == True, combined.info_1.bravo, combined.info.bravo))
   ...: 

There are few steps like above to process this merged data, and finally I reorganized my dataset as below:

In [5]: # Add rsid and quality filters from both genomes ane exomes
   ...: combined = combined.annotate(exome_rsid = combined.rsid)
   ...: combined = combined.annotate(exome_qual = combined.qual)
   ...: combined = combined.annotate(exome_filters = combined.filters)
   ...: combined = combined.annotate(genome_rsid = combined.rsid_1)
   ...: combined = combined.annotate(genome_qual = combined.qual_1)
   ...: combined = combined.annotate(genome_filters = combined.filters_1)
   ...: 
   ...: # Drop info and info_1 fields as no longer necessary
   ...: fields_to_drop = ['info','info_1','rsid','rsid_1','qual','qual_1','filters','filters_1']
   ...: combined = combined.drop(*fields_to_drop)

Actually, I got my error at In [6]:

In [6]: # Write to specified directory
   ...: combined.write('{}/combined.filtered.formatted.gnomad.r2.1.1.sites.{}.ht'.format(out_dir,chrom),overwrite=True)
[Stage 0:============>                                           (60 + 7) / 275]---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-6-a292bdd9877a> in <module>
      1 # Write to specified directory
----> 2 combined.write('{}/combined.filtered.formatted.gnomad.r2.1.1.sites.{}.ht'.format(out_dir,chrom),overwrite=True)

<decorator-gen-1088> in write(self, output, overwrite, stage_locally, _codec_spec)

/gpfs/ycga/project/kahle/pf374/conda_envs/hail0261_py37/lib/python3.7/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615 
    616     return wrapper

/gpfs/ycga/project/kahle/pf374/conda_envs/hail0261_py37/lib/python3.7/site-packages/hail/table.py in write(self, output, overwrite, stage_locally, _codec_spec)
   1269         """
   1270 
-> 1271         Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
   1272 
   1273     def _show(self, n, width, truncate, types):

/gpfs/ycga/project/kahle/pf374/conda_envs/hail0261_py37/lib/python3.7/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
     96                 raise HailUserError(message_and_trace) from None
     97 
---> 98             raise e

/gpfs/ycga/project/kahle/pf374/conda_envs/hail0261_py37/lib/python3.7/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
     72         # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
     73         try:
---> 74             result = json.loads(self._jhc.backend().executeJSON(jir))
     75             value = ir.typ._from_json(result['value'])
     76             timings = result['timings']

/gpfs/ycga/project/kahle/pf374/conda_envs/hail0261_py37/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/gpfs/ycga/project/kahle/pf374/conda_envs/hail0261_py37/lib/python3.7/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     30                 raise FatalError('%s\n\nJava stack trace:\n%s\n'
     31                                  'Hail version: %s\n'
---> 32                                  'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
     33         except pyspark.sql.utils.CapturedException as e:
     34             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: NumberFormatException: For input string: "NA"

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 66 in stage 0.0 failed 1 times, most recent failure: Lost task 66.0 in stage 0.0 (TID 66, localhost, executor driver): java.lang.NumberFormatException: For input string: "NA"
	at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
	at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
	at java.lang.Double.parseDouble(Double.java:538)
	at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:285)
	at scala.collection.immutable.StringOps.toDouble(StringOps.scala:29)
	at is.hail.expr.ir.functions.UtilFunctions$.parseFloat64(UtilFunctions.scala:48)
	at __C874Compiled.__m881toFloat64(Emit.scala)
	at __C874Compiled.applyregion15_28(Emit.scala)
	at __C874Compiled.applyregion8_386(Emit.scala)
	at __C874Compiled.apply(Emit.scala)
	at is.hail.expr.ir.TableMapRows$$anonfun$87$$anonfun$apply$5.apply$mcJJ$sp(TableIR.scala:1876)
	at is.hail.expr.ir.TableMapRows$$anonfun$87$$anonfun$apply$5.apply(TableIR.scala:1875)
	at is.hail.expr.ir.TableMapRows$$anonfun$87$$anonfun$apply$5.apply(TableIR.scala:1875)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$$anon$12.next(Iterator.scala:445)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at is.hail.io.RichContextRDDRegionValue$.writeRowsPartition(RichContextRDDRegionValue.scala:36)
	at is.hail.io.RichContextRDDLong$$anonfun$writeRows$extension$1.apply(RichContextRDDRegionValue.scala:230)
	at is.hail.io.RichContextRDDLong$$anonfun$writeRows$extension$1.apply(RichContextRDDRegionValue.scala:230)
	at is.hail.utils.richUtils.RichContextRDD$.writeParts(RichContextRDD.scala:41)
	at is.hail.utils.richUtils.RichContextRDD$$anonfun$3.apply(RichContextRDD.scala:107)
	at is.hail.utils.richUtils.RichContextRDD$$anonfun$3.apply(RichContextRDD.scala:105)
	at is.hail.sparkextras.ContextRDD$$anonfun$cmapPartitionsWithIndex$1$$anonfun$apply$18.apply(ContextRDD.scala:248)
	at is.hail.sparkextras.ContextRDD$$anonfun$cmapPartitionsWithIndex$1$$anonfun$apply$18.apply(ContextRDD.scala:248)
	at is.hail.utils.richUtils.RichContextRDD$$anonfun$cleanupRegions$1$$anonfun$2.apply(RichContextRDD.scala:59)
	at is.hail.utils.richUtils.RichContextRDD$$anonfun$cleanupRegions$1$$anonfun$2.apply(RichContextRDD.scala:59)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
	at is.hail.utils.richUtils.RichContextRDD$$anonfun$cleanupRegions$1$$anon$1.hasNext(RichContextRDD.scala:68)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:166)
	at is.hail.utils.richUtils.RichContextRDD.writePartitions(RichContextRDD.scala:109)
	at is.hail.io.RichContextRDDLong$.writeRows$extension(RichContextRDDRegionValue.scala:224)
	at is.hail.rvd.RVD.write(RVD.scala:797)
	at is.hail.expr.ir.TableNativeWriter.apply(TableWriter.scala:102)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:825)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:14)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:12)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
	at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:354)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:338)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:335)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:25)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:23)
	at is.hail.utils.package$.using(package.scala:618)
	at is.hail.annotations.Region$.scoped(Region.scala:18)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:23)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:247)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:335)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:379)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:377)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:377)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

java.lang.NumberFormatException: For input string: "NA"
	at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
	at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
	at java.lang.Double.parseDouble(Double.java:538)
	at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:285)
	at scala.collection.immutable.StringOps.toDouble(StringOps.scala:29)
	at is.hail.expr.ir.functions.UtilFunctions$.parseFloat64(UtilFunctions.scala:48)
	at __C874Compiled.__m881toFloat64(Emit.scala)
	at __C874Compiled.applyregion15_28(Emit.scala)
	at __C874Compiled.applyregion8_386(Emit.scala)
	at __C874Compiled.apply(Emit.scala)
	at is.hail.expr.ir.TableMapRows$$anonfun$87$$anonfun$apply$5.apply$mcJJ$sp(TableIR.scala:1876)
	at is.hail.expr.ir.TableMapRows$$anonfun$87$$anonfun$apply$5.apply(TableIR.scala:1875)
	at is.hail.expr.ir.TableMapRows$$anonfun$87$$anonfun$apply$5.apply(TableIR.scala:1875)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$$anon$12.next(Iterator.scala:445)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at is.hail.io.RichContextRDDRegionValue$.writeRowsPartition(RichContextRDDRegionValue.scala:36)
	at is.hail.io.RichContextRDDLong$$anonfun$writeRows$extension$1.apply(RichContextRDDRegionValue.scala:230)
	at is.hail.io.RichContextRDDLong$$anonfun$writeRows$extension$1.apply(RichContextRDDRegionValue.scala:230)
	at is.hail.utils.richUtils.RichContextRDD$.writeParts(RichContextRDD.scala:41)
	at is.hail.utils.richUtils.RichContextRDD$$anonfun$3.apply(RichContextRDD.scala:107)
	at is.hail.utils.richUtils.RichContextRDD$$anonfun$3.apply(RichContextRDD.scala:105)
	at is.hail.sparkextras.ContextRDD$$anonfun$cmapPartitionsWithIndex$1$$anonfun$apply$18.apply(ContextRDD.scala:248)
	at is.hail.sparkextras.ContextRDD$$anonfun$cmapPartitionsWithIndex$1$$anonfun$apply$18.apply(ContextRDD.scala:248)
	at is.hail.utils.richUtils.RichContextRDD$$anonfun$cleanupRegions$1$$anonfun$2.apply(RichContextRDD.scala:59)
	at is.hail.utils.richUtils.RichContextRDD$$anonfun$cleanupRegions$1$$anonfun$2.apply(RichContextRDD.scala:59)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
	at is.hail.utils.richUtils.RichContextRDD$$anonfun$cleanupRegions$1$$anon$1.hasNext(RichContextRDD.scala:68)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)




Hail version: 0.2.61-3c86d3ba497a
Error summary: NumberFormatException: For input string: "NA"

[Stage 0:============>                                           (60 + 1In [7]: 

The Stage just pending there. Hope this information help. Many thanks for your time, really appreciated!

Best,
Po-Ying

I think I understand happening! You have some strings “NA” in fields like combined.info.CADD16snv_PHRED. You’re checking for them with hl.is_missing, which is wrong – even though we might print missing values as NA in show(), that doesn’t mean that hl.is_missing("NA") is true. This is a present string with a value “NA”.

Try running:

combined.filter(combined.CADD16snv_PHRED == 'NA').show(3)

You can change the check to compare with the string 'NA' instead of hl.is_missing, I think that will fix things.

Hi @tpoterba

Thanks for the suggestion! Here is my testing:

In [21]: combined.filter(combined.CADD16snv_PHRED == 'NA').show(3)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-3e09e5853498> in <module>
----> 1 combined.filter(combined.CADD16snv_PHRED == 'NA').show(3)

/gpfs/ycga/project/kahle/pf374/conda_envs/hail0261_py37/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in __eq__(self, other)
    684             ``True`` if the two expressions are equal.
    685         """
--> 686         return self._compare_op("==", other)
    687 
    688     def __ne__(self, other):

/gpfs/ycga/project/kahle/pf374/conda_envs/hail0261_py37/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in _compare_op(self, op, other)
    486         left, right, success = unify_exprs(self, other)
    487         if not success:
--> 488             raise TypeError(f"Invalid '{op}' comparison, cannot compare expressions "
    489                             f"of type '{self.dtype}' and '{other.dtype}'")
    490         res = left._bin_op(op, right, hl.tbool)

TypeError: Invalid '==' comparison, cannot compare expressions of type 'float64' and 'str'

oops, sorry, meant the one from info:

combined.info.CADD16snv_PHRED

(also, run this show() before you’ve annotated to make the floating point fields or it will still fail)

No worries, Actually you are right with combined.CADD16snv_PHRED

In [22]: combined.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'locus': locus<GRCh37> 
    'alleles': array<str> 
    'MetaSVM_pred': str 
    'CADD16snv_PHRED': float64 
    'CADD_phred': float64 
    'MPC': float64 
    'bravo': float64 
    'Exonic_refGene': str 
    'Func_refGene': str 
    'combined_nontopmed_AC': int64 
    'combined_nontopmed_AN': int32 
    'combined_nontopmed_nhomalt': int64 
    'exome_rsid': str 
    'exome_qual': float64 
    'exome_filters': set<str> 
    'genome_rsid': str 
    'genome_qual': float64 
    'genome_filters': set<str> 
----------------------------------------
Key: ['locus', 'alleles']
----------------------------------------

I thought when I got TypeError: Invalid '==' comparison, cannot compare expressions of type 'float64' and 'str' means the hl.null('float64') works fine, so I cannot compare string with float64? (Hope I understand it correctly)

I thought when I got TypeError: Invalid '==' comparison, cannot compare expressions of type 'float64' and 'str' means the hl.null('float64') works fine, so I cannot compare string with float64? (Hope I understand it correctly)

This is all correct. The problem is in this line in your first code block:

   ...: combined = combined.annotate(CADD16snv_PHRED=hl.if_else(hl.is_missing(combined.info.CADD16snv_PHRED) == True, hl.if_else(hl.is_missing(combined.info_1
   ...: .CADD16snv_PHRED) == True, hl.null('float64'), hl.float64(combined.info_1.CADD16snv_PHRED)), hl.float64(combined.info.CADD16snv_PHRED)))

If the value of combined.info.CADD16snv_PHRED is missing, this will return a missing float64. However, if the value is the present string 'NA', hl.is_missing(combined.info.CADD16snv_PHRED) returns False, and you will try to convert it to a float64. I think this line should be:

combined = combined.annotate(CADD16snv_PHRED=hl.if_else(
    combined.info.CADD16snv_PHRED) == 'NA', 
    hl.missing('float64'),  
    hl.float64(combined.info.CADD16snv_PHRED)))

Got you! Thanks for the explain, now I understand what you were saying, it actually could cause problems. I’m working on it, I will keep you updated. Thanks!

In [24]: combined.info.CADD16snv_PHRED.show(3)
+---------------+------------+---------+
| locus         | alleles    |  <expr> |
+---------------+------------+---------+
| locus<GRCh37> | array<str> | float64 |
+---------------+------------+---------+
| 5:11624       | ["A","G"]  |      NA |
| 5:11628       | ["T","C"]  |      NA |
| 5:11628       | ["T","G"]  |      NA |
+---------------+------------+---------+
showing top 3 rows

Hi @tpoterba

I got an issue here:

In [5]: combined.filter(hl.is_missing(combined.info.CADD16snv_PHRED)).info.CADD16snv_PHRED.show(3)
+---------------+------------+---------+
| locus         | alleles    |  <expr> |
+---------------+------------+---------+
| locus<GRCh37> | array<str> | float64 |
+---------------+------------+---------+
| 5:11624       | ["A","G"]  |      NA |
| 5:11628       | ["T","C"]  |      NA |
| 5:11628       | ["T","G"]  |      NA |
+---------------+------------+---------+
showing top 3 rows

In [6]: combined.filter(combined.info.CADD16snv_PHRED == 'NA').info.CADD16snv_PHRED.show(3)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-af2a766e8fe2> in <module>
----> 1 combined.filter(combined.info.CADD16snv_PHRED == 'NA').info.CADD16snv_PHRED.show(3)

/gpfs/ycga/project/kahle/pf374/conda_envs/hail0261_py37/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in __eq__(self, other)
    684             ``True`` if the two expressions are equal.
    685         """
--> 686         return self._compare_op("==", other)
    687 
    688     def __ne__(self, other):

/gpfs/ycga/project/kahle/pf374/conda_envs/hail0261_py37/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in _compare_op(self, op, other)
    486         left, right, success = unify_exprs(self, other)
    487         if not success:
--> 488             raise TypeError(f"Invalid '{op}' comparison, cannot compare expressions "
    489                             f"of type '{self.dtype}' and '{other.dtype}'")
    490         res = left._bin_op(op, right, hl.tbool)

TypeError: Invalid '==' comparison, cannot compare expressions of type 'float64' and 'str'

So, If I cannot use is_missing() to check NA value, I not sure what can I do for convert the NA to float64?

Can you post the output of:

combined = hl.read_table('{}/combined.filtered.gnomad.r2.1.1.sites.{}.ht'.format(out_dir,chrom))
combined.describe()

Of course, but row field is pretty long, so I will cut off some un-used columnsL

In [8]: combined = hl.read_table('{}/combined.filtered.gnomad.r2.1.1.sites.{}.ht'.format(out_dir,chrom))
   ...: combined.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'locus': locus<GRCh37> 
    'alleles': array<str> 
    'rsid': str 
    'qual': float64 
    'filters': set<str> 
    'info': struct {
        AC: array<int32>, 
        AN: int32, 
        AF: array<float64>, 
        ...
        MetaSVM_pred: str, 
        CADD_phred: float64, 
        bravo: float64, 
        MPC: array<str>, 
        CADD16snv_PHRED: float64, 
        Exonic_refGene: str, 
        Func_refGene: str
    } 
    'rsid_1': str 
    'qual_1': float64 
    'filters_1': set<str> 
    'info_1': struct {
        AC: array<int32>, 
        AN: int32, 
        AF: array<float64>, 
        ...
        MetaSVM_pred: str, 
        ...
        CADD_phred: float64, 
        bravo: float64, 
        MPC: array<str>, 
        CADD16snv_PHRED: float64, 
        Exonic_refGene: str, 
        Func_refGene: str
    } 
----------------------------------------
Key: ['locus', 'alleles']
----------------------------------------

Ah! The fields info.CADD16snv_PHRED and info. CADD_phred are already floats. I think the problem is probably MPC then. Can you do:

combined = hl.read_table('{}/combined.filtered.gnomad.r2.1.1.sites.{}.ht'.format(out_dir,chrom))
combined.filter(combined.info.MPC[0] == 'NA').MPC.show()

It works!!! Thanks so much, you are awesome!!! Really appreciated your help!