Lower casing VDS schema key names

rhasson · September 26, 2017, 12:51pm

Hi,

I’m using Amazon Athena to query the VDS files generated by Hail. However, Athena lowercases all column names when they are queried and with VDS column names having upper cased letters the queries fail. Is there a way to iterate through all the schema keys in a VDS and lower case them ?

Thanks
Roy

tpoterba · September 26, 2017, 1:20pm

Hi Roy,
We don’t have any functions to do this for you right now, but I agree it would be useful to add!

It’s possible to do this in Python, though it looks a bit ugly. It’s also going to be slow-ish, since the implementation is a bit mangy, but it works!

def lower_va_schema(vds):

    # the below function is called on structs to recursively generate the expr from lower case names
    def generate_struct_expr(schema, prefix):
        assert isinstance(schema, TStruct)

        exprs = [] 
        for field in schema.fields:
            name = field.name
            typ = field.typ
            full_name = prefix + '.`{}`'.format(name)
            lower = name.lower()
            if isinstance(typ, TStruct):
                right_hand = generate_struct_expr(typ, full_name)
            else:
                right_hand = full_name
            exprs.append('`{}`: {}'.format(lower, right_hand))
        return '{' + ','.join(exprs) + '}'


    va_expr = generate_struct_expr(vds.variant_schema, 'va')
    return vds.annotate_variants_expr('va = {}'.format(va_expr))

Here’s a test:

Pre-lowering schema:

In [21]: pprint(vds.variant_schema)
Struct{
     rsid: String,
     qual: Double,
     filters: Set[String],
     info: Struct{
         NEGATIVE_TRAIN_SITE: Boolean,
         HWP: Double,
         AC: Array[Int],
         culprit: String,
         MQ0: Int,
         ReadPosRankSum: Double,
         AN: Int,
         InbreedingCoeff: Double,
         AF: Array[Double],
         GQ_STDDEV: Double,
         FS: Double,
         DP: Int,
         GQ_MEAN: Double,
         POSITIVE_TRAIN_SITE: Boolean,
         VQSLOD: Double,
         ClippingRankSum: Double,
         BaseQRankSum: Double,
         MLEAF: Array[Double],
         MLEAC: Array[Int],
         MQ: Double,
         QD: Double,
         END: Int,
         DB: Boolean,
         HaplotypeScore: Double,
         MQRankSum: Double,
         CCC: Int,
         NCC: Int,
         DS: Boolean
     },
     FOO: Struct{
         BAR: Struct{
             BAZ: Int
         }
     },
     f: Struct{
         F: Struct{
             f: Struct{
                 F: Int
             }
         }
     }
 }

Now calling the function:


In [22]: pprint(lower_va_schema(vds).variant_schema)
Struct{
     rsid: String,
     qual: Double,
     filters: Set[String],
     info: Struct{
         negative_train_site: Boolean,
         hwp: Double,
         ac: Array[Int],
         culprit: String,
         mq0: Int,
         readposranksum: Double,
         an: Int,
         inbreedingcoeff: Double,
         af: Array[Double],
         gq_stddev: Double,
         fs: Double,
         dp: Int,
         gq_mean: Double,
         positive_train_site: Boolean,
         vqslod: Double,
         clippingranksum: Double,
         baseqranksum: Double,
         mleaf: Array[Double],
         mleac: Array[Int],
         mq: Double,
         qd: Double,
         end: Int,
         db: Boolean,
         haplotypescore: Double,
         mqranksum: Double,
         ccc: Int,
         ncc: Int,
         ds: Boolean
     },
     foo: Struct{
         bar: Struct{
             baz: Int
         }
     },
     f: Struct{
         f: Struct{
             f: Struct{
                 f: Int
             }
         }
     }
 }

tpoterba · September 26, 2017, 1:21pm

Also note that this will overwrite stuff if you have a schema with duplicate lowered elements like va.foo and va.FOO.

rhasson · September 26, 2017, 1:59pm

Thanks Tim, that worked perfectly. I took your code sample and applied it to sample schema as well.

As a side note, I’m writing out the VDS to S3, when I query it via Athena, I’m not seeing any of the sample annotation columns such as sex, population code etc, but it is present in the sample_schema. Any idea why ?

If you are interested I’ll be happy to share with you a draft of my blog on running Hail on Amazon EMR and querying VDS with Amazon Athena.

Thanks
Roy

tpoterba · September 26, 2017, 2:15pm

Hi Roy,
While the 0.1 stable build uses parquet to store the variants and genotypes, we store the samples and sample metadata in a json file (vds/metadata.json.gz)

If you want to toss samples / sample annotations in parquet, you can do that with Spark dataframes:


vds.samples_table().to_dataframe().write.parquet('out.parquet')

My syntax here could be slightly off.

rhasson · September 26, 2017, 2:17pm

Got it. Thanks for the clarification. I’ll do that.

Roy

hail-team · September 26, 2017, 2:55pm

I think it’s probably a good idea to do the same with the variant data, too – we’re changing the on-disk representation away from parquet in 0.2 and peeking inside a VDS with Athena will no longer work.

rhasson · September 26, 2017, 7:23pm

Thank you for the heads up. I’ll follow your guidance.

Roy

tpoterba · October 3, 2017, 6:33pm

By the way – we’d love to see the blog post!

Oron_Navon · November 29, 2017, 8:27am

Found @rhasson 's blog post (good stuff there!):

cseed · December 1, 2017, 11:58pm

@hail-team Although we won’t use it as the default format or call it a VDS, I intend to have first-class methods for importing/exporting genotype data and associated annotation data as Parquet in 0.2. I’ll add it to the 0.2 beta checklist: http://dev.hail.is/t/0-2-beta-checklist/25.

Topic		Replies	Views
New Python features; print_schema and show_globals removed Updates	0	778	January 28, 2017
Create project VCFs from VDS Hail Query & hailctl	6	460	November 29, 2022
Error reading filename with brackets (DNAnexus) Hail Query & hailctl	0	190	January 18, 2024
Removed import_annotations_table, added VariantDataset.from_keytable Updates	0	838	March 21, 2017
VEP- vep_json_schema specification for colocated_variants.frequencies Hail Query & hailctl	3	472	April 14, 2022

Lower casing VDS schema key names

Related topics