Hi folks—
I am new to Hail and the UK Biobank, and I am trying to accomplish what I believe to be a pretty simple task: load a pVCF for one of the chromosome blocks and explore the data (and eventually extract variants/genotypes/patient identifiers (EIDs) within a certain region, although I haven’t gotten there yet). However, when I try to explore the data, the commands run slowly (and I bet I am also not doing it in an efficient way).
The pVCF I’m interested in is ~23 gb in size. I’m using the following code:
import hail as hl
hl.init()
from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()
mt = hl.import_vcf('file:////mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c3_b3_v1.vcf.gz', force_bgz=True, reference_genome="GRCh38")
Some commands run quickly, such as
mt.describe()
However, when I try to look at the loci of the first few rows with the following code, it takes a long time (I’d be very much interested in better alternatives):
row_table = mt.rows().select()
row_table.head(4).locus.show()
I timed the second command and apparently it took 2.5 min:
I am already using what I thought would be a pretty powerful Jupyter notebook instance on DNAnexus, mem2_ssd2_x16. I don’t know whether it’s relevant, but here is information from lscpu
and lsblk
:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 2300.176
BogoMIPS: 4600.15
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-15
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 47M 1 loop
loop1 7:1 0 47M 1 loop
loop2 7:2 0 25.1M 1 loop
loop3 7:3 0 55.5M 1 loop
loop4 7:4 0 55.6M 1 loop
xvda 202:0 0 13G 0 disk
├─xvda1 202:1 0 12.9G 0 part
├─xvda14 202:14 0 4M 0 part
Any help would be appreciated!
Best,
Jeremy