Import multiple plink files

jhchung · May 12, 2017, 7:24pm

Is there a way to import plink files that are split into separate chromosomes? I see that gen and bgen files have wildcard matching to import data, is there a similar feature for plink and/or vcf files?

Thanks,
-Jonathan

tpoterba · May 14, 2017, 3:15am

no way to do this at the moment, I think. We can add a vds union function, which should solve this problem.

jjfarrell · October 1, 2017, 3:07pm

A union of VariantDataSets would be a useful feature for concatenating chromosomes. Is there any work being done of this yet?

John

tpoterba · October 1, 2017, 3:19pm

Definitely in the plans for 0.2! If we do it sooner, maybe we can even backport it to 0.1 as well.

Stephane_Bourgeois · October 6, 2017, 12:10pm

It would be great to have the possibility to make a union of multiple vds files which can have different samples, representing different cohorts, not only for concatenating chromosomes. Is that in the plans for 0.2 also?

Cheers,

Stephane

tpoterba · October 6, 2017, 1:22pm

We have VariantDataset.join, which joins datasets with different samples on common variants.

Stephane_Bourgeois · October 6, 2017, 3:26pm

That is what I was looking for, thanks. One more question though, what happens when the variants are not on the same strand, or when there is an allele discrepancy?

tpoterba · October 7, 2017, 1:28am

We don’t currently have anything to correct strand flips or ref/alt swaps. The variant 1:100:A:T won’t join with 1:100:T:A.

Stephane_Bourgeois · October 7, 2017, 9:58am

Is this functionality in the works for the next version?

tpoterba · October 7, 2017, 2:58pm

We’re expanding our reference functionality for 0.2 but don’t currently have any concrete plans to use fasta files (or similar) to actually do realignment.

Stephane_Bourgeois · October 9, 2017, 8:57am

Even without using FASTA, a simple option that would attempt to flip A/C to T/G (for example) to check if there would be a match would be useful, possibly also another option instructing join to consider A/C and C/A as a simple difference in minor allele between cohorts/populations would also increase functionality. Just an idea

tpoterba · October 9, 2017, 10:26am

All good points. Some of this stuff is tough to do automatically (it may be the wrong thing for some datasets) but we should certainly have general functionality that extends join to do a unification step. This is a good motivating example!

Stephane_Bourgeois · October 9, 2017, 11:21am

I agree this would not be suitable for all datasets, no question there. But a way to detect/flip strands, like in Plink, and disregard allele frequency when merging, as could be the case when merging populations in which the AF is different (multi-ethnic HapMap 3 panel for example), or variants with a MAF very close to 50%, would alleviate the need to go back to Plink to merge datasets.
I’m sure actually implementing it may not be as straighforward as it sounds, so maybe for v0.3 ?

Topic		Replies	Views
Loading many datasets into single VDS? Help [0.1]	3	1906	January 31, 2018
Joining Variant Datasets - Missed variant in inner join - Outer join Feature Requests	2	1337	January 25, 2018
Join VDS's with same samples? Help [0.1]	1	872	October 23, 2017
VariantDatasetCombiner - dataset contains both multiallelic variants and duplicated loci Hail Query & hailctl	1	354	January 17, 2023
VariantDatasetCombiner - dataset contains both multiallelic variants and duplicated loci for review Hail Query & hailctl	13	586	July 24, 2023

Import multiple plink files

Related topics