Lower casing VDS schema key names


#1

Hi,

I’m using Amazon Athena to query the VDS files generated by Hail. However, Athena lowercases all column names when they are queried and with VDS column names having upper cased letters the queries fail. Is there a way to iterate through all the schema keys in a VDS and lower case them ?

Thanks
Roy


#2

Hi Roy,
We don’t have any functions to do this for you right now, but I agree it would be useful to add!

It’s possible to do this in Python, though it looks a bit ugly. It’s also going to be slow-ish, since the implementation is a bit mangy, but it works!

def lower_va_schema(vds):

    # the below function is called on structs to recursively generate the expr from lower case names
    def generate_struct_expr(schema, prefix):
        assert isinstance(schema, TStruct)

        exprs = [] 
        for field in schema.fields:
            name = field.name
            typ = field.typ
            full_name = prefix + '.`{}`'.format(name)
            lower = name.lower()
            if isinstance(typ, TStruct):
                right_hand = generate_struct_expr(typ, full_name)
            else:
                right_hand = full_name
            exprs.append('`{}`: {}'.format(lower, right_hand))
        return '{' + ','.join(exprs) + '}'


    va_expr = generate_struct_expr(vds.variant_schema, 'va')
    return vds.annotate_variants_expr('va = {}'.format(va_expr))

Here’s a test:

Pre-lowering schema:

In [21]: pprint(vds.variant_schema)
Struct{
     rsid: String,
     qual: Double,
     filters: Set[String],
     info: Struct{
         NEGATIVE_TRAIN_SITE: Boolean,
         HWP: Double,
         AC: Array[Int],
         culprit: String,
         MQ0: Int,
         ReadPosRankSum: Double,
         AN: Int,
         InbreedingCoeff: Double,
         AF: Array[Double],
         GQ_STDDEV: Double,
         FS: Double,
         DP: Int,
         GQ_MEAN: Double,
         POSITIVE_TRAIN_SITE: Boolean,
         VQSLOD: Double,
         ClippingRankSum: Double,
         BaseQRankSum: Double,
         MLEAF: Array[Double],
         MLEAC: Array[Int],
         MQ: Double,
         QD: Double,
         END: Int,
         DB: Boolean,
         HaplotypeScore: Double,
         MQRankSum: Double,
         CCC: Int,
         NCC: Int,
         DS: Boolean
     },
     FOO: Struct{
         BAR: Struct{
             BAZ: Int
         }
     },
     f: Struct{
         F: Struct{
             f: Struct{
                 F: Int
             }
         }
     }
 }

Now calling the function:


In [22]: pprint(lower_va_schema(vds).variant_schema)
Struct{
     rsid: String,
     qual: Double,
     filters: Set[String],
     info: Struct{
         negative_train_site: Boolean,
         hwp: Double,
         ac: Array[Int],
         culprit: String,
         mq0: Int,
         readposranksum: Double,
         an: Int,
         inbreedingcoeff: Double,
         af: Array[Double],
         gq_stddev: Double,
         fs: Double,
         dp: Int,
         gq_mean: Double,
         positive_train_site: Boolean,
         vqslod: Double,
         clippingranksum: Double,
         baseqranksum: Double,
         mleaf: Array[Double],
         mleac: Array[Int],
         mq: Double,
         qd: Double,
         end: Int,
         db: Boolean,
         haplotypescore: Double,
         mqranksum: Double,
         ccc: Int,
         ncc: Int,
         ds: Boolean
     },
     foo: Struct{
         bar: Struct{
             baz: Int
         }
     },
     f: Struct{
         f: Struct{
             f: Struct{
                 f: Int
             }
         }
     }
 }

#3

Also note that this will overwrite stuff if you have a schema with duplicate lowered elements like va.foo and va.FOO.


#4

Thanks Tim, that worked perfectly. I took your code sample and applied it to sample schema as well.

As a side note, I’m writing out the VDS to S3, when I query it via Athena, I’m not seeing any of the sample annotation columns such as sex, population code etc, but it is present in the sample_schema. Any idea why ?

If you are interested I’ll be happy to share with you a draft of my blog on running Hail on Amazon EMR and querying VDS with Amazon Athena.

Thanks
Roy


#5

Hi Roy,
While the 0.1 stable build uses parquet to store the variants and genotypes, we store the samples and sample metadata in a json file (vds/metadata.json.gz)

If you want to toss samples / sample annotations in parquet, you can do that with Spark dataframes:


vds.samples_table().to_dataframe().write.parquet('out.parquet')

My syntax here could be slightly off.


#6

Got it. Thanks for the clarification. I’ll do that.

Roy


#7

I think it’s probably a good idea to do the same with the variant data, too – we’re changing the on-disk representation away from parquet in 0.2 and peeking inside a VDS with Athena will no longer work.


#8

Thank you for the heads up. I’ll follow your guidance.

Roy


#9

By the way – we’d love to see the blog post!


#10

Found @rhasson 's blog post (good stuff there!):


#11

@hail-team Although we won’t use it as the default format or call it a VDS, I intend to have first-class methods for importing/exporting genotype data and associated annotation data as Parquet in 0.2. I’ll add it to the 0.2 beta checklist: http://dev.hail.is/t/0-2-beta-checklist/25.