Bgen and bgen index in different directories


I have a bit of an unusual situation. Because of things that are out of my control, all of my input bgen and index files are in individual directories.

I’m wondering if it’s possible for the bgen and index files to be in different directories. I thought that the index_file_map could be the option but I’m having trouble getting things to work.

bgen_index_map = dict([("data/bgen/chr22.bgen", "data/index/chr22.bgen.idx")])

mt = hl.import_bgen(path = "data/bgen/chr22.bgen",
                    entry_fields = ["GP"],
                    index_file_map = bgen_index_map)

> 2018-11-30 20:22:22 Hail: WARN: BGEN file `data/bgen/chr22.bgen' contains no sample ID block and no sample ID file given.
>   Using _0, _1, ..., _N as sample IDs.
> 2018-11-30 20:22:22 Hail: INFO: Number of BGEN files parsed: 1
> 2018-11-30 20:22:22 Hail: INFO: Number of samples in BGEN files: 487409
> 2018-11-30 20:22:22 Hail: INFO: Number of variants across all BGEN files: 1255683

These commands run without error but if I try and do anything, I get a fatal error.

> FatalError: FileNotFoundException: data/bgen/chr22.bgen.idx

The error message makes it seems like hail still looks for the index file in the same directory as the bgen file.

Any help with this would be great.


You’re correct that the index_file_map parameter should support having index files in a different location than the BGEN data.

Could you please post the full stack trace?

Also, how did you create your index files? The default extension for the most recent version of Hail is .idx2

Unfortunately, I am not using the latest version of Hail. The version I have available is build devel-f2b0dca9f506. I’m trying to get my informatics group to update the package.

Stack trace:

FatalError                                Traceback (most recent call last)
<command-37581> in <module>()
----> 1

/databricks/spark/python/hail/typecheck/ in wrapper(*args, **kwargs)
    545         def wrapper(*args, **kwargs):
    546             args_, kwargs_ = check_all(f, args, kwargs, checkers, is_method=is_method)
--> 547             return f(*args_, **kwargs_)
    549         update_wrapper(wrapper, f)

/databricks/spark/python/hail/expr/expressions/ in show(self, n, width, truncate, types)
    684             Print an extra header line with the type of each field.
    685         """
--> 686         print(self._show(n, width, truncate, types))
    688     def _show(self, n=10, width=90, truncate=None, types=True):

/databricks/spark/python/hail/expr/expressions/ in _show(self, n, width, truncate, types)
    707         if source is not None:
    708             name = source._fields_inverse.get(self, name)
--> 709         t = self._to_table(name)
    710         if t.key is not None and name in t.key:

Unfortunately, the feature for the index_file_map parameter was added in commit cf235511d2 on September 19th. Older versions assume the index file is in the same place as the BGEN file. Once you are able to upgrade your version, check out this blog post describing the changes. You will have to reindex your files as we changed the on-disk format.

Thanks for the information. Hopefully, we can get an updated version on our system.