I’ve been using the import_matrix_table to start work with some large .tsv files.
However, I am finding the datatypes allowed for the entry-field to be a bit limiting for my application. Is there any way to extend the allow entry-field to array or even a general struct?
Thanks for your help!
What do the entries look like? Are they JSON? If so, you can import the entry field as a string and use hl.parse_json to convert to any other type.
For the time being an array would be the most helpful entry type, but the ability to have a more general object-like set of fields as is the case when importing VCFs would be the most useful.
Storing the data as string and converting using the json_parser on the the fly seems to come with a lot of overhead and so may not be a practical approach.
Thanks for the help so far!
I think the two are going to be roughly equivalent. Parsing text is pretty slow.
I agree, maybe I misread your previous suggestion, but I understood that you were suggesting storing the entry object as a json-string and parsing upon each retrieval. Eitherway, this doesnt seem practical to me.
As a first pass, is there any way to store an array of int32 in the entry-field?
Thanks!
Perhaps you can say a little more about the application? I’m generally assuming that this text matrix is an interchange format, and performance on import isn’t quite as much of a concern as downstream processing.
Do you mean in the entry field of the text matrix? Sure, you could use comma-delimited integers for instance, and do something like the following after importing as a string:
mt = hl.import_matrix_table(...)
mt = mt.select_entries(int_array = mt.comma_delimited_ints.split(',').map(lambda x: hl.int32(x))