Get invalid_loci, which are dropped


I am importing vcf files which contains standard chromosome names but also some ambiguous.
Is there a way to preserve all lines which have invalid_loci and are not supported/default for the genome

mt = hl.methods.import_vcf(input_vcf, contig_recoding=recode, force_bgz=True, skip_invalid_loci=True)

For example if the vcf file have chromosome name such as chr1_r3r3r, how can I still save somehow these lines? For example to save the lines with invalid_loci in second matrixtable?

We can’t put them in a second matrix table. The fact that they were invalid means we can’t make objects for them. If you do skip_invalid_loci=False, you should get an error message about which one was invalid.

However, you could do the following to pull out the invalid lines:

mt = hl.import_vcf(input_vcf, contig_recoding=recode, force_bgz=True, 
    reference_genome=None) # will import locus as struct, not dtype locus

mt = mt.filter_rows(~hl.is_valid_locus(,, reference_genome='your rg here'))


Ah, didn’t know about that! That’s probably what you want, neat.

@tpoterba thank you, this was my idea.

Is it possible after that to store the mt with the wrong chromosome names back to a vcf file like:

hl.export_vcf(mt ,‘wrong_locus.vcf’)?

Fot this line I am still getting the error:

HailException: Invalid locus ‘chr1_gl000192_random:99283’ found. Contig ‘chr1_gl000192_random’ is not in the reference genome ‘GRCh37’

Did you use reference_genome=None?

yes, chr1_gl000192_random is not standard and I cannot define reference_genome, but I want to preserve exactly chr1_gl000192_random in the vcf output file so I am able to see this mistakes.

Before you export_vcf do mt.describe() and show us the print out

Global fields:
Column fields:
‘s’: str
Row fields:
‘locus’: locus
‘alleles’: array
‘rsid’: str
‘qual’: float64
‘filters’: set
‘info’: struct {
AC: array,
AF: array,
AN: int32,
BaseQRankSum: float64,
CCC: int32,
DP: int32,
DS: bool,
END: int32,
FS: float64,
GQ_MEAN: float64,
GQ_STDDEV: float64,
HWP: float64,
HaplotypeScore: float64,
InbreedingCoeff: float64,
MLEAC: array,
MLEAF: array,
MQ: float64,
MQ0: int32,
MQRankSum: float64,
NCC: int32,
QD: float64,
ReadPosRankSum: float64,
SOR: float64,
VQSLOD: float64,
culprit: str
Entry fields:
‘AD’: array
‘DP’: int32
‘GQ’: int32
‘GT’: call
‘PGT’: call
‘PID’: str
‘PL’: array
‘SB’: array
Column key: [‘s’]
Row key: [‘locus’, ‘alleles’]

Yeah, that’s what I thought. You didn’t specify reference_genome=None when you imported the VCF. I can tell because the type of locus is locus, instead of struct.

this was the problem, I though that None and GRCh37 are synonymous, but when I set to reference_genome=None, I got again:

ValueError: Method ‘export_vcf’ requires row key to be two fields ‘locus’ (type ‘locus’) and ‘alleles’ (type ‘array’)
‘locus’: struct{contig: str, position: int32}
‘alleles’: array

Should I define some kind of a reference_genome to use export_vcf()? or should I define a header?

I found


and it works, but I would be happy if there is a solution for hl.export_vcf(mt,output)

You probably just want mt.rows().export('rows.tsv'). That will include all the relevant locus and allele information to find the invalid rows.

There is likely no solution for export_vcf, as hail does not allow you to export invalid VCFs.

Ok thank you!

I think we can probably extend export_vcf to work with the struct locus.

1 Like

It would be useful, yes