Hi, Hail!
I’m trying to understand how with sample_qc statistics I get an inconsistent number of rows after exporting a table.
How I discovered the problem:
- First I conduct a sample qc.
mt = hl.sample_qc(mt)
mt.select_cols(mt.sample_qc.n_snp,
mt.sample_qc.n_deletion,
mt.sample_qc.n_insertion,
mt.sample_qc.r_ti_tv)\
.cols().show()
2022-03-14 21:44:17 Hail: INFO: Coerced sorted dataset
+-----------+-------+------------+-------------+----------+
| s | n_snp | n_deletion | n_insertion | r_ti_tv |
+-----------+-------+------------+-------------+----------+
| str | int64 | int64 | int64 | float64 |
+-----------+-------+------------+-------------+----------+
| "P512_10" | 22333 | 65 | 28 | 3,21e+00 |
| "P512_7" | 22712 | 70 | 23 | 3,12e+00 |
| "P512_8" | 22389 | 60 | 25 | 3,26e+00 |
+-----------+-------+------------+-------------+----------+
- I write the table into a file.
mt.rows().select('consequence')\
.export('result.tsv')
- Check consequence counts in command line
cut -f3 result.tsv | sort | uniq -c | sort -k1,1n
1 consequence
640 splice_donor_variant
1845 splice_acceptor_variant
5735 frameshift_variant
9659 stop_gained
41369 synonymous_variant
87966 missense_variant
Сonclusion:
“~22 000 n_snps per sample and only 3 samples in dataset” vs “~88 000 missense_variants” ???
How mt.sample_qc.n_snp
, mt.sample_qc.n_deletion
, mt.sample_qc.n_insertion
works?
P.S.
a little more information about the data
mt.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
'sample_qc': struct {
dp_stats: struct {
mean: float64,
stdev: float64,
min: float64,
max: float64
},
gq_stats: struct {
mean: float64,
stdev: float64,
min: float64,
max: float64
},
call_rate: float64,
n_called: int64,
n_not_called: int64,
n_filtered: int64,
n_hom_ref: int64,
n_het: int64,
n_hom_var: int64,
n_non_ref: int64,
n_singleton: int64,
n_snp: int64,
n_insertion: int64,
n_deletion: int64,
n_transition: int64,
n_transversion: int64,
n_star: int64,
r_ti_tv: float64,
r_het_hom_var: float64,
r_insertion_deletion: float64
}
----------------------------------------
Row fields:
'locus': locus<GRCh37>
'alleles': array<str>
'rsid': str
'qual': float64
'filters': set<str>
'info': struct {
AC: array<int32>,
AF: array<float64>,
AN: int32,
BaseQRankSum: float64,
DP: int32,
END: int32,
ExcessHet: float64,
FS: float64,
InbreedingCoeff: float64,
MLEAC: array<int32>,
MLEAF: array<float64>,
MQ: float64,
MQRankSum: float64,
QD: float64,
RAW_MQandDP: array<int32>,
ReadPosRankSum: float64,
SOR: float64,
CSQ: array<str>
}
'a_index': int32
'was_split': bool
'variant_qc': struct {
dp_stats: struct {
mean: float64,
stdev: float64,
min: float64,
max: float64
},
gq_stats: struct {
mean: float64,
stdev: float64,
min: float64,
max: float64
},
AC: array<int32>,
AF: array<float64>,
AN: int32,
homozygote_count: array<int32>,
call_rate: float64,
n_called: int64,
n_not_called: int64,
n_filtered: int64,
n_het: int64,
n_non_ref: int64,
het_freq_hwe: float64,
p_value_hwe: float64,
p_value_excess_het: float64
}
'kept_transcripts': array<str>
'consequence': str
'gene': str
'gnomad_AF': str
'gnomad_NFE_AF': str
'MPC': str
'hom_ref': int64
'hom_var': int64
'het': int64
----------------------------------------
Entry fields:
'AD': array<int32>
'DP': int32
'GQ': int32
'GT': call
'MIN_DP': int32
'PGT': call
'PID': str
'PL': array<int32>
'PS': int32
'RGQ': int32
'SB': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------
data size
mt.count()
(147214, 3)
Variant summary stats
hl.summarize_variants(mt)
Number of variants: 147214
==============================
Alleles per variant
-------------------
2 alleles: 147214 variants
==============================
Variants per contig
-------------------
1: 14992 variants
2: 11162 variants
3: 8812 variants
4: 6494 variants
5: 7121 variants
6: 6728 variants
7: 6811 variants
8: 5045 variants
9: 6047 variants
10: 6159 variants
11: 8811 variants
12: 8167 variants
13: 2834 variants
14: 4787 variants
15: 4931 variants
16: 5492 variants
17: 7522 variants
18: 2683 variants
19: 9411 variants
20: 3731 variants
21: 1454 variants
22: 3067 variants
X: 4952 variants
Y: 1 variants
==============================
Allele type distribution
------------------------
SNP: 141392 alternate alleles (Ti: 34646, Tv: 106746, ratio: 0.32)
Deletion: 5629 alternate alleles
Insertion: 183 alternate alleles
Complex: 9 alternate alleles
Unknown: 1 alternate alleles
==============================
P.P.S.
Even if I imagine that I previously got the “consequence” field wrong, this would not explain more than twice the difference between the number of SNPs and the number of rows in the dataset.
P.P.P.S.
Even if I have completely non-overlapping 3 samples in the data, that still doesn’t explain the issue.
Are there any ideas?