Spark Tweak - to_pandas function

cdeniz · September 20, 2018, 9:56pm

My samples’ names look similar to this:

family1.father
family1.mother
family1.child
family1.sibling1
family1.sibling2

Therefore, I can find variants across families, compare also across specific members or within family members too.

Your example above is a good guideline. Thanks!

tpoterba · September 20, 2018, 10:02pm

we also have some rudimentary trio-based methods:

trio_matrix
mendel_errors
de_novo

https://hail.is/docs/devel/methods/genetics.html

cdeniz · September 20, 2018, 10:06pm

I think this is what I needed:

split = ht.s.split("_")
ht = ht.annotate(family = split[0],
                 role = split[1])
per_family_ht = ht.group_by('family')\
    .aggregate(mean_pheno_1 = hl.agg.mean(ht.pheno_1))

Thanks again,

Carlos

cdeniz · September 24, 2018, 1:11am

Hi Tim,

I have a couple of questions:

What would be the cleanest way to split a dictionary generated by agg.counter (w/o using a loop, maybe?). I want to use the output to annotate a MatrixTable.
I want to execute the de_novo function but I need to generate my .fam files first. Is there a way to generate a single table (maybe a merge) from the outputs of my individual family de_novo output tables? My data was generated by GATK so I shouln’t have an issue to execute the function.

Thanks!

tpoterba · September 24, 2018, 9:22am

Split how? If you want to use that result in expressions, you can do something like:

c = mt.aggregate_rows(hl.agg.counter(...))
c = hl.literal(c)  # converts to expression
mt = mt.annotate_rows(bar = c[mt.foo] + 5)

tpoterba · September 24, 2018, 9:23am

I don’t totally understand what you’re asking, sorry. Is the problem the generation of the pedigree files (you could also construct those in Python if that’s easier) or the usage of the results of de_novo?

cdeniz · September 24, 2018, 12:55pm

Since the de_novo function returns a table, and supposing I will execute it in 10 trios, how can I merge/bind those 10 tables into a single one?

tpoterba · September 24, 2018, 1:37pm

do you have one MatrixTable for each trio? If so, use ht.union. Otherwise, you’ll call de novo events for each trio.

cdeniz · September 25, 2018, 4:48pm

Hi Tim,

Due to the nature of my annotation file, I can only annotate over rows using the locus as key. If I set my row keys to be Gene and Locus and my column keys family and role. How can I do 1) an aggregation to count the number of variant calls that a single family has per gene and 2) the number of variants per gene, per family member?

Also, is there a way to apply a filter (filter_cols) using a balseline pattern? For instance, I have the following in my set: A2M, A2M-AS1, A2ML1, and I’ve used many RegExps but only the exact match works.

Thanks!

tpoterba · September 25, 2018, 4:57pm

is there a way to apply a filter ( filter_cols ) using a balseline pattern

Yes, java regex syntax should work. If there’s a specific case we can look at where things seem to be misbehaving, that would be helpful.

How can I do 1) an aggregation to count the number of variant calls that a single family has per gene and 2) the number of variants per gene, per family member?

Have you managed to get a pedigree constructed? Using the trio_matrix function will make this super easy. In that case, you’ll want to do something like:

tm = hl.trio_matrix(mt, pedigree)
tm = tm.group_rows_by(tm.gene).aggregate(
    n_calls_proband = hl.agg.sum(mt.proband.GT.n_alt_alleles()),
    n_calls_mother = hl.agg.sum(mt.mother.GT.n_alt_alleles()),
    n_calls_father = hl.agg.sum(mt.father.GT.n_alt_alleles()))

Now you have all that information, I think (per family is just summing those three values)

cdeniz · September 25, 2018, 6:58pm

Since I have 2,000+ families, I was hoping there was something like: mt.group_rows_by('Gene').aggregate(fam_count=agg.count()) or mt.group_cols_by('family').aggregate(gene_count=agg.count()) (this is the closest I get to my desired result) in order to get a summarized matrix. Let me generate the pedigree files and explore the trio_matrix option.
Thanks again!

tpoterba · September 25, 2018, 7:06pm

you can do this programmatically in Python, too. You can definitely do what you’ve just written, as well. I think there might be one step which is missing (this PR adds it): the ability to aggregate row/col fields as well as entry fields.

cdeniz · September 25, 2018, 8:11pm

This works in Python:

set=['AM','AMG','AM1','AM-1']
r = re.compile("AM.*")
print(list(filter(r.match,set)))
Ans: ['AM', 'AMG', 'AM1', 'AM-1']

And this is what I do in hail:

1)
yy = xx.filter_rows(xx['Gene']=='A2M', keep=True)
yy.aggregate_rows(hl.agg.collect_as_set(yy['Gene']))
ans: {'A2M'}
2)
yy = xx.filter_rows(xx['Gene']=='A2M.*', keep=True)
yy.aggregate_rows(hl.agg.collect_as_set(yy['Gene']))
Ans: set()

From:

you can do this programmatically in Python, too. You can definitely do what you’ve just written, as well. I think there might be one step which is missing (this PR adds it):

I try to do :

dataset_result = (mt.group_cols_by('family')
   .aggregate_rows(gene_count = agg.count(mt['Gene'])))

But I get:

AttributeError: GroupedMatrixTable instance has no field, method, or property 'aggregate_rows'
   Did you mean:
       GroupedMatrixTable method: 'aggregate'

And If I use aggregate only, I don’t get the count over mt.Gene, but rather the family member count:

dataset_result.gene_count.show()
Gene  |   locus  |   family  |  gene_count | 
| str | locus<GRCh37> | str  |  int64 |
| G1  |        L1    |  fam1 |  4   |

tpoterba · September 25, 2018, 8:25pm

1)
yy = xx.filter_rows(xx['Gene']=='A2M', keep=True)
yy.aggregate_rows(hl.agg.collect_as_set(yy['Gene']))
ans: {'A2M'}
2)
yy = xx.filter_rows(xx['Gene']=='A2M.*', keep=True)
yy.aggregate_rows(hl.agg.collect_as_set(yy['Gene']))
Ans: set()

== is the wrong method - you want

xx['Gene'].matches('A2M.*')

Regarding the aggregation: let’s try in steps

mt2 = mt.group_rows_by('Gene').aggregate(n_alt_alleles = hl.agg.sum(mt.GT.n_alt_alleles()))
# now mt2 is a gene-by-sample matrix with n_alt_alleles as the only entry field
mt3 = mt2.group_cols_by(mt2.family).aggregate(alt_alleles = hl.agg.sum(mt2.n_alt_alleles))
# now mt3 is a gene-by-family matrix of alt alleles

does mt2 satisfy the “the number of variants per gene, per family member?” that you want?

tpoterba · September 25, 2018, 8:27pm

you can’t currently aggregate in both dimensions at once – group_cols_by().aggregate() aggregates over columns, preserving the row fields / number of rows. group_rows_by().aggregate() aggregates over rows, preserving the columns exactly.

cdeniz · September 25, 2018, 9:06pm

I’ll do further verification but this worked! and by adding this small tweak (adding the role):

mt3 = mt2.group_cols_by('family','role').aggregate(alt_alleles = hl.agg.sum(mt2.n_alt_alleles))

I get exactly what I need. The key is to use the sum over mt.GT.n_alt_alleles(), and not the count.

BTW, this one also worked --> xx['Gene'].matches('A2M.*')

Thanks again!

cdeniz · September 25, 2018, 10:48pm

If m3.entry is a StructExpression, how can it be changed to a Table, to use Table exclusive functions?

tpoterba · September 26, 2018, 1:11am

Grab the m3.entries() - that’s the table you want

cdeniz · October 23, 2018, 8:46pm

Hi Tim,
I have 4 questions for you, all related to the same task:
1)
I’m using the trio_matrix and the de_novo functions to get the de novo variants. I’ve been using trio pedigree files, which means I have to repeat the process n patient times. I have over 4K individuals and the procedure is the same for all of them. My initial matrix table has all the patients. I just apply a filter_cols function on a per family basis before obtaining the corresponding variants. Instead of executing a for loop with the data of each individual, is there is a different approach to make this task more efficient?
2)
Once I get my SNVs and after applying filters for de_novo selection, I do an mt_tm.count(), since I want to know the total SNV # and the de_novo variants per individual. However, I get the same count number before and after applying the filters (even in different variables). Which is the best procedure to get the right count?
3)
All my resulting tables (or matrix tables) from the steps above have the exact same keys. Is there is a way to do vertical stacking in Hail just like in the function pd.concat? I want to have a table with all the results for further analysis.
4)
Finally, I have an entry called score of type int32. Some of its elements are NAs. I need to do operations with that entry and I tried doing something like: mt=mt.annotate_entries(test=mt.score+1) expecting to have a 1 instead of NAs but the NAs do not change. I tried a few things with is_nan but I couldn’t find the right expression. What would do the trick in this case?

Thanks,

Carlos

tpoterba · October 23, 2018, 8:56pm

If you concatenate all the trio files (in unix, maybe) and read in as one pedigree, this will be MUCH MUCH MUCH faster and let you do trio_matrix / de_novo once.
are you filtering rows, cols, or entries? If you are filtering rows or cols, then the count should change. If you’re filtering entries, then the count of rows/cols will stay the same.
If you do it all in parallel as in (1) then this is no longer necessary, but ht.union is like pd.concat.
the functions you want are hl.is_defined and hl.is_missing. In Hail (and most language), a missing value is distinct from nan. You might also be interested in the function hl.coalesce which takes any number of arguments and returns the first non-missing – e.g. hl.coalesce(mt.score, 0) + 1 will return 1 for missing scores.

Topic		Replies	Views
Cant convert mt into pandas df Hail Query & hailctl	7	692	October 8, 2020
Error when trying to use to_pandas() on a large hail table Hail Query & hailctl	3	491	December 6, 2022
Potential compatibility issue with Spark 2.2.0? Help [0.1]	10	2427	August 24, 2018
AttributeError: 'DataFrame' object has no attribute 'to_spark' Help [0.1]	19	7397	August 1, 2018
From_pandas triggering python different versions error Hail Query & hailctl	20	969	May 1, 2020

Spark Tweak - to_pandas function

Related topics