Remove related individuals from a vcf file

How can I remove related individuals from a vcf file using Hail?

Take a look here: https://hail.is/docs/stable/hail.KeyTable.html#hail.KeyTable.maximal_independent_set

Note that pc_relate has some scaling issues, so if your data are large, the given pipeline (pc_relate followed by maximal_independent_set) may struggle to finish.

Is there a complete documentation on how to use Hail?
a sort of pdf

Yes, the documentation is here: https://hail.is/docs/stable/index.html

Hail is a maturing tool, and we would certainly benefit from having a pipeline cookbook. But every method is documented there, and most have examples!

I did see this but is there a downloadable version?

yes, if you download the distributions from the getting started page the documentation is built inside those folders!

hc.import_vcf function was working for me before and now I can not read the file:

In [38]: vds = hc.import_vcf(‘filtered.vcf’)
…:
…:
…:
…:
2017-11-16 22:14:00 Hail: WARN: `filtered.vcf’ refers to no files

FatalError Traceback (most recent call last)
in ()
----> 1 vds = hc.import_vcf(‘filtered.vcf’)
2
3
4

in import_vcf(self, path, force, force_bgz, header_file, min_partitions, drop_samples, store_gq, pp_as_pl, skip_bad_ad, generic, call_fields)

/Users/AleRodriguez/hail/python/hail/java.pyc in handle_py4j(func, *args, **kwargs)
119 raise FatalError(’%s\n\nJava stack trace:\n%s\n’
120 ‘Hail version: %s\n’
–> 121 ‘Error summary: %s’ % (deepest, full, Env.hc().version, deepest))
122 except py4j.protocol.Py4JError as e:
123 if e.args[0].startswith(‘An error occurred while calling’):

FatalError: HailException: arguments refer to no files

Java stack trace:
is.hail.utils.HailException: arguments refer to no files
at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:6)
at is.hail.utils.package$.fatal(package.scala:27)
at is.hail.io.vcf.LoadVCF$.globAllVCFs(LoadVCF.scala:105)
at is.hail.HailContext.importVCFs(HailContext.scala:509)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

Hail version: 0.1-0ab38b4
Error summary: HailException: arguments refer to no files

Hey @alerodriguez, it looks like filtered.vcf doesn’t exist in your current directory or you do not have permission to read it. If you execute the ls command from the same directory where you launched ipython, is filtered.vcf listed?

For example, this is what it should look like:

dking@wmb16-359 # pwd
/Users/dking/projects/hail-data/foo
dking@wmb16-359 # ls -al
total 2535176
drwxr-xr-x    3 dking  CHARLES\Domain Users         102 Nov 17 10:59 .
drwxr-xr-x  140 dking  CHARLES\Domain Users        4760 Nov 17 10:58 ..
-rw-r--r--    1 dking  CHARLES\Domain Users  1298009787 Nov 17 10:59 filtered.vcf

Note in particular that:

  • filtered.vcf is listed, and
  • the permissions for filtered.vcf starts with -rw- and the owner is listed as dking (myself). Both of these must be true for any program to be able to read filtered.vcf.

And one more thing! If you’re running on a cluster that has HDFS installed, filenames, by default, refer to files in HDFS instead of files in the local file system. If this is your situation, then you should move filtered.vcf into the HDFS file system, then load it inside Hail.

On a cluster, Hail, generally, cannot load a file from a local file system, precisely because that file system is local to the particular machine to which you connected. When Hail is running on a cluster, every machine must be able to read the files. HDFS is exactly a system for letting every machine in the cluster read the files.