Hello I am new to Hail please forgive my ignorance.
I am using the UKBB and hail seems really difficult to use and yet the only possible way to complete the following:
I need to isolate a locus within a very large VCF and then annotate it with VEP/clinvar data.
Subsequently, I need to subset the rows further based on features such as the amino acid position- my code is as follows:
loc,aapositions='chr21:25880535-26171128',[2011,2010,2018,2146,2150,2149,2021,2141,2080,2140,2147,2144,2020,2032,2113,2172,2146,2044,2137,2149,2077,2149]
chrom,start,end=loc.replace(':',' ').replace('-',' ').split(' ')
aapositions_list=[str(mut)+';'+str(mut) for mut in aapositions]
file='/mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c'+str(chrom).replace('chr','')+'*.vcf.gz'
chrom_files=glob.glob(file)
for chrom_file in chrom_files:
chrom_file = 'file://'+chrom_file
mt = hl.import_vcf(chrom_file,
force_bgz=True,
reference_genome="GRCh38",
array_elements_required=False)
# filtering to locus
region_start = hl.Locus.parse(chrom+':'+str(start), reference_genome='GRCh38')
region_end = hl.Locus.parse(chrom+':'+str(end), reference_genome='GRCh38')
# Filter to the specified region
mt = mt.filter_rows(
(mt.locus >= region_start) & (mt.locus <= region_end)
, )
if mt.count_rows()>0:
print('rows present in '+str(chrom))
db = hl.experimental.DB(region='us', cloud='aws')
annotated_mt = db.annotate_rows_db(mt, 'dbNSFP_variants')
aapos_set = hl.set(aapositions_list)
filtered_mt = annotated_mt.filter_rows(
(hl.is_defined(annotated_mt.dbNSFP_variants) &
hl.any(lambda x: aapos_set.contains(x.aapos),annotated_mt.dbNSFP_variants))
)
if filtered_mt.count_rows()>0:
rows_table = filtered_mt.rows()
exploded_table = rows_table.explode(rows_table.dbNSFP_variants)
flattened_table = exploded_table.transmute(**exploded_table.dbNSFP_variants)
else:
flattened_table=mt.rows()
break
However this code constantly errors out with what I believe is a memory error:
Error summary: ClassNotFoundException: is.hail.backend.spark.SparkBackend$$anon$5$RDDPartition
or when it doesn’t do this- it proves extremely difficult to explore as I try to export the results to a file- which can take a long time and this also errors out frequently.
I’m hoping that there is something simple in my code that can be spotted to resolve the slowness? Increasing my memory request does not seem to improve things…