Mergin MatrixTable raised strange row type error

I created two MatrixTable (sparse matrices: old and new) and I want to merge them. I typically use the experimental function combine_gvcfs. But I am now in a strange case resulting in an error:

TypeError: All input tables to multi_way_zip_join must have the same row type

I wen to the code of the (multi_way_zip_join) in order to identify why this error is raised (and why I’ve never seen it before) and I tried to reproduce it.

  • Test one
old.row.dtype == new.row.dtype, \
old.globals.dtype == new.globals.dtype, \
old.row_key.dtype == new.row_key.dtype, \
old.col_key.dtype == new.col_key.dtype

Returned:

(True, True, True, True)
  • Test two:
tables = [ old, new ]
head = tables[ 0 ]
any(head.row.dtype != t.row.dtype for t in tables)

Returned:

False

I also went through this but I don’t think it is the right thread, right?

Any idea?


  • Running on Apache Spark version 3.1.1
  • HAIL version 0.2.65

Can you paste the stack trace to this error? I agree, this seems strange.

Hi!

Of course! Here it is:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-27-8fbce3dd5a10> in <module>
      1 from hail.experimental.vcf_combiner.vcf_combiner import combine_gvcfs
      2 
----> 3 comb = combine_gvcfs([ old, new ])

/usr/local/lib/python3.9/site-packages/decorator.py in fun(*args, **kw)
    230             if not kwsyntax:
    231                 args, kw = fix(args, kw, sig)
--> 232             return caller(func, *(extras + args), **kw)
    233     fun.__name__ = func.__name__
    234     fun.__doc__ = func.__doc__

/usr/local/lib/python3.9/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/usr/local/lib/python3.9/site-packages/hail/experimental/vcf_combiner/vcf_combiner.py in combine_gvcfs(mts)
    283     module provides no method of repartitioning data.
    284     """
--> 285     ts = hl.Table.multi_way_zip_join([localize(mt) for mt in mts], 'data', 'g')
    286     combined = combine(ts)
    287     return unlocalize(combined)

/usr/local/lib/python3.9/site-packages/decorator.py in fun(*args, **kw)
    230             if not kwsyntax:
    231                 args, kw = fix(args, kw, sig)
--> 232             return caller(func, *(extras + args), **kw)
    233     fun.__name__ = func.__name__
    234     fun.__doc__ = func.__doc__

/usr/local/lib/python3.9/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/usr/local/lib/python3.9/site-packages/hail/table.py in multi_way_zip_join(tables, data_field_name, global_field_name)
   3487             raise TypeError('All input tables to multi_way_zip_join must have the same key type')
   3488         if any(head.row.dtype != t.row.dtype for t in tables):
-> 3489             raise TypeError('All input tables to multi_way_zip_join must have the same row type')
   3490         if any(head.globals.dtype != t.globals.dtype for t in tables):
   3491             raise TypeError('All input tables to multi_way_zip_join must have the same global type')

TypeError: All input tables to multi_way_zip_join must have the same row type

We should fix the error message here to provide printouts of the types. For now, let’s try monkey patching:

def monkeypatched_zip_join(tables, data_field_name, global_field_name):
        head = tables[0]
    
        if any(head.row.dtype != t.row.dtype for t in tables):
            raise TypeError(f'All input tables to multi_way_zip_join must have the same row type\n  ' + '\n  '.join(str(t.row.dtype) for t in tables))
        return hl.Table.multi_way_zip_join(tables, data_field_name, global_field_name)

hl.Table.multi_way_zip_join = monkeypatched_zip_join

Try that above your combine_gvcfs call, it should give us more info.

Okay…

I think this goes in the same direction of my “test one”, which seems correct. Here the output:

TypeError: All input tables to multi_way_zip_join must have the same row type
  struct{locus: locus<GRCh37>, alleles: array<str>, rsid: str, __entries: array<struct{DP: int32, END: int32, GQ: int32, LA: array<int32>, LAD: array<int32>, LGT: call, LPGT: call, LPL: array<int32>, MIN_DP: int32, PID: str, RGQ: int32, SB: array<int32>, gvcf_info: struct{BaseQRankSum: float64, DB: bool, ExcessHet: float64, InbreedingCoeff: float64, MLEAC: array<int32>, MLEAF: array<float64>, MQRankSum: float64, RAW_MQandDP: array<int32>, ReadPosRankSum: float64, MQ_DP: int32, VarDP: int32, QUALapprox: int32}}>}
  struct{locus: locus<GRCh37>, alleles: array<str>, rsid: str, __entries: array<struct{LA: array<int32>, LGT: call, LAD: array<int32>, LPGT: call, LPL: array<int32>, RGQ: int32, END: int32, gvcf_info: struct{BaseQRankSum: float64, DB: bool, ExcessHet: float64, InbreedingCoeff: float64, MLEAC: array<int32>, MLEAF: array<float64>, MQRankSum: float64, RAW_MQandDP: array<int32>, ReadPosRankSum: float64, MQ_DP: int32, VarDP: int32, QUALapprox: int32}, DP: int32, GQ: int32, MIN_DP: int32, PID: str, SB: array<int32>}>}

Should I order in the same way the content of gvcf_info? To see:

  • old structure:
old.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
----------------------------------------
Entry fields:
    'DP': int32
    'END': int32
    'GQ': int32
    'LA': array<int32>
    'LAD': array<int32>
    'LGT': call
    'LPGT': call
    'LPL': array<int32>
    'MIN_DP': int32
    'PID': str
    'RGQ': int32
    'SB': array<int32>
    'gvcf_info': struct {
        BaseQRankSum: float64, 
        DB: bool, 
        ExcessHet: float64, 
        InbreedingCoeff: float64, 
        MLEAC: array<int32>, 
        MLEAF: array<float64>, 
        MQRankSum: float64, 
        RAW_MQandDP: array<int32>, 
        ReadPosRankSum: float64, 
        MQ_DP: int32, 
        VarDP: int32, 
        QUALapprox: int32
    }
----------------------------------------
Column key: ['s']
Row key: ['locus']
----------------------------------------
  • new structure:
new.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
----------------------------------------
Entry fields:
    'LA': array<int32>
    'LGT': call
    'LAD': array<int32>
    'LPGT': call
    'LPL': array<int32>
    'RGQ': int32
    'END': int32
    'gvcf_info': struct {
        BaseQRankSum: float64, 
        DB: bool, 
        ExcessHet: float64, 
        InbreedingCoeff: float64, 
        MLEAC: array<int32>, 
        MLEAF: array<float64>, 
        MQRankSum: float64, 
        RAW_MQandDP: array<int32>, 
        ReadPosRankSum: float64, 
        MQ_DP: int32, 
        VarDP: int32, 
        QUALapprox: int32
    }
    'DP': int32
    'GQ': int32
    'MIN_DP': int32
    'PID': str
    'SB': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus']
----------------------------------------

Ah, the issue is out-of-order entry fields (these methods assume the same order).

You can unify, I think, by doing:

new2 = new.select_entries(*old.entry)

which is an alias for

new2 = new.select_entries('DP', 'END', 'GQ', ...)

So… Yes, it seems it solves the main issue although it raised a new one, related to the level of recursion. Here the stack trace:

---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
<ipython-input-39-f4a43b7d5364> in <module>
     12 new3 = new.select_entries(*old.entry)
     13 
---> 14 comb = combine_gvcfs([ old, new2 ])

/usr/local/lib/python3.9/site-packages/decorator.py in fun(*args, **kw)
    230             if not kwsyntax:
    231                 args, kw = fix(args, kw, sig)
--> 232             return caller(func, *(extras + args), **kw)
    233     fun.__name__ = func.__name__
    234     fun.__doc__ = func.__doc__

/usr/local/lib/python3.9/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/usr/local/lib/python3.9/site-packages/hail/experimental/vcf_combiner/vcf_combiner.py in combine_gvcfs(mts)
    283     module provides no method of repartitioning data.
    284     """
--> 285     ts = hl.Table.multi_way_zip_join([localize(mt) for mt in mts], 'data', 'g')
    286     combined = combine(ts)
    287     return unlocalize(combined)

<ipython-input-39-f4a43b7d5364> in monkeypatched_zip_join(tables, data_field_name, global_field_name)
      6         if any(head.row.dtype != t.row.dtype for t in tables):
      7             raise TypeError(f'All input tables to multi_way_zip_join must have the same row type\n  ' + '\n  '.join(str(t.row.dtype) for t in tables))
----> 8         return hl.Table.multi_way_zip_join(tables, data_field_name, global_field_name)
      9 
     10 hl.Table.multi_way_zip_join = monkeypatched_zip_join

<ipython-input-39-f4a43b7d5364> in monkeypatched_zip_join(tables, data_field_name, global_field_name)
      6         if any(head.row.dtype != t.row.dtype for t in tables):
      7             raise TypeError(f'All input tables to multi_way_zip_join must have the same row type\n  ' + '\n  '.join(str(t.row.dtype) for t in tables))
----> 8         return hl.Table.multi_way_zip_join(tables, data_field_name, global_field_name)
      9 
     10 hl.Table.multi_way_zip_join = monkeypatched_zip_join

<ipython-input-39-f4a43b7d5364> in monkeypatched_zip_join(tables, data_field_name, global_field_name)
      6         if any(head.row.dtype != t.row.dtype for t in tables):
      7             raise TypeError(f'All input tables to multi_way_zip_join must have the same row type\n  ' + '\n  '.join(str(t.row.dtype) for t in tables))
----> 8         return hl.Table.multi_way_zip_join(tables, data_field_name, global_field_name)
      9 
     10 hl.Table.multi_way_zip_join = monkeypatched_zip_join

<ipython-input-39-f4a43b7d5364> in monkeypatched_zip_join(tables, data_field_name, global_field_name)
      6         if any(head.row.dtype != t.row.dtype for t in tables):
      7             raise TypeError(f'All input tables to multi_way_zip_join must have the same row type\n  ' + '\n  '.join(str(t.row.dtype) for t in tables))
----> 8         return hl.Table.multi_way_zip_join(tables, data_field_name, global_field_name)
      9 
     10 hl.Table.multi_way_zip_join = monkeypatched_zip_join

<ipython-input-39-f4a43b7d5364> in monkeypatched_zip_join(tables, data_field_name, global_field_name)
      6         if any(head.row.dtype != t.row.dtype for t in tables):
      7             raise TypeError(f'All input tables to multi_way_zip_join must have the same row type\n  ' + '\n  '.join(str(t.row.dtype) for t in tables))
----> 8         return hl.Table.multi_way_zip_join(tables, data_field_name, global_field_name)
      9 
     10 hl.Table.multi_way_zip_join = monkeypatched_zip_join

... last 5 frames repeated, from the frame below ...

<ipython-input-39-f4a43b7d5364> in monkeypatched_zip_join(tables, data_field_name, global_field_name)
      6         if any(head.row.dtype != t.row.dtype for t in tables):
      7             raise TypeError(f'All input tables to multi_way_zip_join must have the same row type\n  ' + '\n  '.join(str(t.row.dtype) for t in tables))
----> 8         return hl.Table.multi_way_zip_join(tables, data_field_name, global_field_name)
      9 
     10 hl.Table.multi_way_zip_join = monkeypatched_zip_join

RecursionError: maximum recursion depth exceeded while calling a Python object

oh, oops, I messed up the monkey patch – should have looked something like:

old_zip_join hl.Table.multi_way_zip_join
def monkeypatched_zip_join(tables, data_field_name, global_field_name):
        head = tables[0]
    
        if any(head.row.dtype != t.row.dtype for t in tables):
            raise TypeError(f'All input tables to multi_way_zip_join must have the same row type\n  ' + '\n  '.join(str(t.row.dtype) for t in tables))
        return old_zip_join(tables, data_field_name, global_field_name)

hl.Table.multi_way_zip_join = monkeypatched_zip_join

you can take all that out though if the types are now working

Perfect! The issue was the order of the fields.

Thanks!