Spark Tweak - to_pandas function


#21

My samples’ names look similar to this:

family1.father
family1.mother
family1.child
family1.sibling1
family1.sibling2

Therefore, I can find variants across families, compare also across specific members or within family members too.

Your example above is a good guideline. Thanks!


#22

we also have some rudimentary trio-based methods:

trio_matrix
mendel_errors
de_novo

https://hail.is/docs/devel/methods/genetics.html


#23

I think this is what I needed:

split = ht.s.split("_")
ht = ht.annotate(family = split[0],
                 role = split[1])
per_family_ht = ht.group_by('family')\
    .aggregate(mean_pheno_1 = hl.agg.mean(ht.pheno_1))

Thanks again,

Carlos


#24

Hi Tim,

I have a couple of questions:

  1. What would be the cleanest way to split a dictionary generated by agg.counter (w/o using a loop, maybe?). I want to use the output to annotate a MatrixTable.

  2. I want to execute the de_novo function but I need to generate my .fam files first. Is there a way to generate a single table (maybe a merge) from the outputs of my individual family de_novo output tables? My data was generated by GATK so I shouln’t have an issue to execute the function.

Thanks!


#25

Split how? If you want to use that result in expressions, you can do something like:

c = mt.aggregate_rows(hl.agg.counter(...))
c = hl.literal(c)  # converts to expression
mt = mt.annotate_rows(bar = c[mt.foo] + 5)

#26

I don’t totally understand what you’re asking, sorry. Is the problem the generation of the pedigree files (you could also construct those in Python if that’s easier) or the usage of the results of de_novo?


#27

Since the de_novo function returns a table, and supposing I will execute it in 10 trios, how can I merge/bind those 10 tables into a single one?


#28

do you have one MatrixTable for each trio? If so, use ht.union. Otherwise, you’ll call de novo events for each trio.


#29

Hi Tim,

Due to the nature of my annotation file, I can only annotate over rows using the locus as key. If I set my row keys to be Gene and Locus and my column keys family and role. How can I do 1) an aggregation to count the number of variant calls that a single family has per gene and 2) the number of variants per gene, per family member?

Also, is there a way to apply a filter (filter_cols) using a balseline pattern? For instance, I have the following in my set: A2M, A2M-AS1, A2ML1, and I’ve used many RegExps but only the exact match works.

Thanks!


#30

is there a way to apply a filter ( filter_cols ) using a balseline pattern

Yes, java regex syntax should work. If there’s a specific case we can look at where things seem to be misbehaving, that would be helpful.

How can I do 1) an aggregation to count the number of variant calls that a single family has per gene and 2) the number of variants per gene, per family member?

Have you managed to get a pedigree constructed? Using the trio_matrix function will make this super easy. In that case, you’ll want to do something like:

tm = hl.trio_matrix(mt, pedigree)
tm = tm.group_rows_by(tm.gene).aggregate(
    n_calls_proband = hl.agg.sum(mt.proband.GT.n_alt_alleles()),
    n_calls_mother = hl.agg.sum(mt.mother.GT.n_alt_alleles()),
    n_calls_father = hl.agg.sum(mt.father.GT.n_alt_alleles()))

Now you have all that information, I think (per family is just summing those three values)


#31

Since I have 2,000+ families, I was hoping there was something like: mt.group_rows_by('Gene').aggregate(fam_count=agg.count()) or mt.group_cols_by('family').aggregate(gene_count=agg.count()) (this is the closest I get to my desired result) in order to get a summarized matrix. Let me generate the pedigree files and explore the trio_matrix option.
Thanks again!


#32

you can do this programmatically in Python, too. You can definitely do what you’ve just written, as well. I think there might be one step which is missing (this PR adds it): the ability to aggregate row/col fields as well as entry fields.


#33

This works in Python:

set=['AM','AMG','AM1','AM-1']
r = re.compile("AM.*")
print(list(filter(r.match,set)))
Ans: ['AM', 'AMG', 'AM1', 'AM-1']

And this is what I do in hail:

1)
yy = xx.filter_rows(xx['Gene']=='A2M', keep=True)
yy.aggregate_rows(hl.agg.collect_as_set(yy['Gene']))
ans: {'A2M'}
2)
yy = xx.filter_rows(xx['Gene']=='A2M.*', keep=True)
yy.aggregate_rows(hl.agg.collect_as_set(yy['Gene']))
Ans: set()

From:

you can do this programmatically in Python, too. You can definitely do what you’ve just written, as well. I think there might be one step which is missing (this PR adds it):

I try to do :

dataset_result = (mt.group_cols_by('family')
   .aggregate_rows(gene_count = agg.count(mt['Gene'])))

But I get:

AttributeError: GroupedMatrixTable instance has no field, method, or property 'aggregate_rows'
   Did you mean:
       GroupedMatrixTable method: 'aggregate'

And If I use aggregate only, I don’t get the count over mt.Gene, but rather the family member count:

dataset_result.gene_count.show()
Gene  |   locus  |   family  |  gene_count | 
| str | locus<GRCh37> | str  |  int64 |
| G1  |        L1    |  fam1 |  4   |

#34
1)
yy = xx.filter_rows(xx['Gene']=='A2M', keep=True)
yy.aggregate_rows(hl.agg.collect_as_set(yy['Gene']))
ans: {'A2M'}
2)
yy = xx.filter_rows(xx['Gene']=='A2M.*', keep=True)
yy.aggregate_rows(hl.agg.collect_as_set(yy['Gene']))
Ans: set()

== is the wrong method - you want

xx['Gene'].matches('A2M.*')

Regarding the aggregation: let’s try in steps

mt2 = mt.group_rows_by('Gene').aggregate(n_alt_alleles = hl.agg.sum(mt.GT.n_alt_alleles()))
# now mt2 is a gene-by-sample matrix with n_alt_alleles as the only entry field
mt3 = mt2.group_cols_by(mt2.family).aggregate(alt_alleles = hl.agg.sum(mt2.n_alt_alleles))
# now mt3 is a gene-by-family matrix of alt alleles

does mt2 satisfy the “the number of variants per gene, per family member?” that you want?


#35

you can’t currently aggregate in both dimensions at once – group_cols_by().aggregate() aggregates over columns, preserving the row fields / number of rows. group_rows_by().aggregate() aggregates over rows, preserving the columns exactly.


#36

I’ll do further verification but this worked! and by adding this small tweak (adding the role):

mt3 = mt2.group_cols_by('family','role').aggregate(alt_alleles = hl.agg.sum(mt2.n_alt_alleles))

I get exactly what I need. The key is to use the sum over mt.GT.n_alt_alleles(), and not the count.

BTW, this one also worked --> xx['Gene'].matches('A2M.*')

Thanks again!


#37

If m3.entry is a StructExpression, how can it be changed to a Table, to use Table exclusive functions?


#38

Grab the m3.entries() - that’s the table you want


#39

Hi Tim,
I have 4 questions for you, all related to the same task:
1)
I’m using the trio_matrix and the de_novo functions to get the de novo variants. I’ve been using trio pedigree files, which means I have to repeat the process n patient times. I have over 4K individuals and the procedure is the same for all of them. My initial matrix table has all the patients. I just apply a filter_cols function on a per family basis before obtaining the corresponding variants. Instead of executing a for loop with the data of each individual, is there is a different approach to make this task more efficient?
2)
Once I get my SNVs and after applying filters for de_novo selection, I do an mt_tm.count(), since I want to know the total SNV # and the de_novo variants per individual. However, I get the same count number before and after applying the filters (even in different variables). Which is the best procedure to get the right count?
3)
All my resulting tables (or matrix tables) from the steps above have the exact same keys. Is there is a way to do vertical stacking in Hail just like in the function pd.concat? I want to have a table with all the results for further analysis.
4)
Finally, I have an entry called score of type int32. Some of its elements are NAs. I need to do operations with that entry and I tried doing something like: mt=mt.annotate_entries(test=mt.score+1) expecting to have a 1 instead of NAs but the NAs do not change. I tried a few things with is_nan but I couldn’t find the right expression. What would do the trick in this case?

Thanks,

Carlos


#40
  1. If you concatenate all the trio files (in unix, maybe) and read in as one pedigree, this will be MUCH MUCH MUCH faster and let you do trio_matrix / de_novo once.

  2. are you filtering rows, cols, or entries? If you are filtering rows or cols, then the count should change. If you’re filtering entries, then the count of rows/cols will stay the same.

  3. If you do it all in parallel as in (1) then this is no longer necessary, but ht.union is like pd.concat.

  4. the functions you want are hl.is_defined and hl.is_missing. In Hail (and most language), a missing value is distinct from nan. You might also be interested in the function hl.coalesce which takes any number of arguments and returns the first non-missing – e.g. hl.coalesce(mt.score, 0) + 1 will return 1 for missing scores.