How can I remove related individuals from a vcf file using Hail?
Take a look here: https://hail.is/docs/stable/hail.KeyTable.html#hail.KeyTable.maximal_independent_set
Note that pc_relate has some scaling issues, so if your data are large, the given pipeline (pc_relate followed by maximal_independent_set) may struggle to finish.
Is there a complete documentation on how to use Hail?
a sort of pdf
Yes, the documentation is here: https://hail.is/docs/stable/index.html
Hail is a maturing tool, and we would certainly benefit from having a pipeline cookbook. But every method is documented there, and most have examples!
I did see this but is there a downloadable version?
yes, if you download the distributions from the getting started page the documentation is built inside those folders!
hc.import_vcf function was working for me before and now I can not read the file:
In [38]: vds = hc.import_vcf(‘filtered.vcf’)
…:
…:
…:
…:
2017-11-16 22:14:00 Hail: WARN: `filtered.vcf’ refers to no files
FatalError Traceback (most recent call last)
in ()
----> 1 vds = hc.import_vcf(‘filtered.vcf’)
2
3
4
in import_vcf(self, path, force, force_bgz, header_file, min_partitions, drop_samples, store_gq, pp_as_pl, skip_bad_ad, generic, call_fields)
/Users/AleRodriguez/hail/python/hail/java.pyc in handle_py4j(func, *args, **kwargs)
119 raise FatalError(’%s\n\nJava stack trace:\n%s\n’
120 ‘Hail version: %s\n’
–> 121 ‘Error summary: %s’ % (deepest, full, Env.hc().version, deepest))
122 except py4j.protocol.Py4JError as e:
123 if e.args[0].startswith(‘An error occurred while calling’):
FatalError: HailException: arguments refer to no files
Java stack trace:
is.hail.utils.HailException: arguments refer to no files
at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:6)
at is.hail.utils.package$.fatal(package.scala:27)
at is.hail.io.vcf.LoadVCF$.globAllVCFs(LoadVCF.scala:105)
at is.hail.HailContext.importVCFs(HailContext.scala:509)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Hail version: 0.1-0ab38b4
Error summary: HailException: arguments refer to no files
Hey @alerodriguez, it looks like filtered.vcf
doesn’t exist in your current directory or you do not have permission to read it. If you execute the ls
command from the same directory where you launched ipython
, is filtered.vcf
listed?
For example, this is what it should look like:
dking@wmb16-359 # pwd
/Users/dking/projects/hail-data/foo
dking@wmb16-359 # ls -al
total 2535176
drwxr-xr-x 3 dking CHARLES\Domain Users 102 Nov 17 10:59 .
drwxr-xr-x 140 dking CHARLES\Domain Users 4760 Nov 17 10:58 ..
-rw-r--r-- 1 dking CHARLES\Domain Users 1298009787 Nov 17 10:59 filtered.vcf
Note in particular that:
-
filtered.vcf
is listed, and - the permissions for
filtered.vcf
starts with-rw-
and the owner is listed asdking
(myself). Both of these must be true for any program to be able to readfiltered.vcf
.
And one more thing! If you’re running on a cluster that has HDFS installed, filenames, by default, refer to files in HDFS instead of files in the local file system. If this is your situation, then you should move filtered.vcf
into the HDFS file system, then load it inside Hail.
On a cluster, Hail, generally, cannot load a file from a local file system, precisely because that file system is local to the particular machine to which you connected. When Hail is running on a cluster, every machine must be able to read the files. HDFS is exactly a system for letting every machine in the cluster read the files.