Hi, I was importing .bed, .bim and .fam files in plink format using Hail and I noticed a problem with the creation of the MatrixTable. Even if I got:
2020-08-21 08:47:17 Hail: INFO: Found 1141 samples in fam file.
2020-08-21 08:47:17 Hail: INFO: Found 730059 variants in bim file. after the import with import_plink:
when I run mt.count_cols() and mt.count_rows() I got 1141 and 0 as the results.
Do anyone know how can I solve this? I’d like to get 1141 samples as the result of the rows’ count.
Many thanks
It sounds like all the loci in your dataset are invalid and, as a result, were skipped. Are you sure your dataset is aligned to GRCh38? You might try GRCh37.
You might also look at the documentation on import_plink's contig_recoding argument. By default, Hail assumes that, for example, Chromosome 1 is encoded as “1” in GRCh38 in PLINK. You can inspect the representation of your chromosomes by importing without a reference genome:
Thanks! Yes, my data are aligned to GRCh38. The problem was linked to skip_invalid_loci option.
I didn’t understand the reason, but since there were in my data some SNPs with chr or position = 0, also all SNPs with the correct locus information were interpreted as invalid too. I solved it by deleting the wrong SNPs from the data before importing the data.