What would be the cleanest way to split a dictionary generated by agg.counter (w/o using a loop, maybe?). I want to use the output to annotate a MatrixTable.
I want to execute the de_novo function but I need to generate my .fam files first. Is there a way to generate a single table (maybe a merge) from the outputs of my individual family de_novo output tables? My data was generated by GATK so I shouln’t have an issue to execute the function.
I don’t totally understand what you’re asking, sorry. Is the problem the generation of the pedigree files (you could also construct those in Python if that’s easier) or the usage of the results of de_novo?
Due to the nature of my annotation file, I can only annotate over rows using the locus as key. If I set my row keys to be Gene and Locus and my column keys family and role. How can I do 1) an aggregation to count the number of variant calls that a single family has per gene and 2) the number of variants per gene, per family member?
Also, is there a way to apply a filter (filter_cols) using a balseline pattern? For instance, I have the following in my set: A2M, A2M-AS1, A2ML1, and I’ve used many RegExps but only the exact match works.
is there a way to apply a filter ( filter_cols ) using a balseline pattern
Yes, java regex syntax should work. If there’s a specific case we can look at where things seem to be misbehaving, that would be helpful.
How can I do 1) an aggregation to count the number of variant calls that a single family has per gene and 2) the number of variants per gene, per family member?
Have you managed to get a pedigree constructed? Using the trio_matrix function will make this super easy. In that case, you’ll want to do something like:
Since I have 2,000+ families, I was hoping there was something like: mt.group_rows_by('Gene').aggregate(fam_count=agg.count()) or mt.group_cols_by('family').aggregate(gene_count=agg.count()) (this is the closest I get to my desired result) in order to get a summarized matrix. Let me generate the pedigree files and explore the trio_matrix option.
Thanks again!
you can do this programmatically in Python, too. You can definitely do what you’ve just written, as well. I think there might be one step which is missing (this PR adds it): the ability to aggregate row/col fields as well as entry fields.
you can do this programmatically in Python, too. You can definitely do what you’ve just written, as well. I think there might be one step which is missing (this PR adds it):
mt2 = mt.group_rows_by('Gene').aggregate(n_alt_alleles = hl.agg.sum(mt.GT.n_alt_alleles()))
# now mt2 is a gene-by-sample matrix with n_alt_alleles as the only entry field
mt3 = mt2.group_cols_by(mt2.family).aggregate(alt_alleles = hl.agg.sum(mt2.n_alt_alleles))
# now mt3 is a gene-by-family matrix of alt alleles
does mt2 satisfy the “the number of variants per gene, per family member?” that you want?
you can’t currently aggregate in both dimensions at once – group_cols_by().aggregate() aggregates over columns, preserving the row fields / number of rows. group_rows_by().aggregate() aggregates over rows, preserving the columns exactly.
Hi Tim,
I have 4 questions for you, all related to the same task:
1)
I’m using the trio_matrix and the de_novo functions to get the de novo variants. I’ve been using trio pedigree files, which means I have to repeat the process n patient times. I have over 4K individuals and the procedure is the same for all of them. My initial matrix table has all the patients. I just apply a filter_cols function on a per family basis before obtaining the corresponding variants. Instead of executing a for loop with the data of each individual, is there is a different approach to make this task more efficient?
2)
Once I get my SNVs and after applying filters for de_novo selection, I do an mt_tm.count(), since I want to know the total SNV # and the de_novo variants per individual. However, I get the same count number before and after applying the filters (even in different variables). Which is the best procedure to get the right count?
3)
All my resulting tables (or matrix tables) from the steps above have the exact same keys. Is there is a way to do vertical stacking in Hail just like in the function pd.concat? I want to have a table with all the results for further analysis.
4)
Finally, I have an entry called score of type int32. Some of its elements are NAs. I need to do operations with that entry and I tried doing something like: mt=mt.annotate_entries(test=mt.score+1) expecting to have a 1 instead of NAs but the NAs do not change. I tried a few things with is_nan but I couldn’t find the right expression. What would do the trick in this case?
If you concatenate all the trio files (in unix, maybe) and read in as one pedigree, this will be MUCH MUCH MUCH faster and let you do trio_matrix / de_novo once.
are you filtering rows, cols, or entries? If you are filtering rows or cols, then the count should change. If you’re filtering entries, then the count of rows/cols will stay the same.
If you do it all in parallel as in (1) then this is no longer necessary, but ht.union is like pd.concat.
the functions you want are hl.is_defined and hl.is_missing. In Hail (and most language), a missing value is distinct from nan. You might also be interested in the function hl.coalesce which takes any number of arguments and returns the first non-missing – e.g. hl.coalesce(mt.score, 0) + 1 will return 1 for missing scores.