Statement from Human Longevity Inc. on PNAS Paper on Face and other Trait Predictions from Whole Genome Sequencing and Machine Learning for Identification of Individuals
September 5, 2017
As outlined in a peer reviewed study recently published in the journal, Proceedings of the National Academy of Sciences (PNAS), Human Longevity researchers set out to see what traits could be predicted and thus used to identify individuals applying machine learning algorithms to whole genome data, without assuming any additional information such as age, sex, and ethnic being shared alongside the genome. The study centered on 1,061 people from diverse ethno-geographic backgrounds. The authors readily acknowledged that this was a very small cohort and that much larger cohorts would be needed to make much more precise predictions and identifications. The researchers stand behind their methodology (there are more than 40 pages in the paper’s supplemental material outlining all methods) and invite all to review the PNAS paper.
As the team states in the conclusion of the paper:
We have presented predictive models for facial structure, voice, eye color, skin color, height, weight, and BMI from common genetic variation and have developed a model for estimating age from WGS data. Despite limitations in statistical power due to the small sample size of 1,061 individuals, predictions are sound. Although individually, each predictive model provided limited information about an individual’s identity, we have derived an optimal similarity measure from multiple prediction models that enabled matching between genomes and phenotypic profiles with good accuracy. Over time, predictions will get more precise, and, thus, the results of this work will be of greater consideration in the current discussion on genome privacy protection.
As also stated in the paper and publicly, a central reason for doing this study was to point out that as larger and larger genomic and associated phenotypic information databases exist (both public and private), individuals who are participating in these studies need to fully understand the implications of having their genomes in such databases. Those in our field and policymakers must also understand this situation. A core belief from the HLI researchers is that there is now no such thing as true de-identification and full privacy in publicly accessible databases because one’s genome is the ultimate identifier in that it codes for all the physical traits that are recognized as that individual. Put simply, if you have a genome from the public domain, researchers can sketch a picture of that individual, thus identifying that person. And while current methodologies are less sophisticated, the field is rapidly advancing so methodologies will only improve.
We agree that sharing of genomic data is invaluable for research, however to reiterate, our results suggest that genomes cannot be considered fully de-identifiable and should be shared by using appropriate levels of security and due diligence. At HLI we employ some of the best minds and tools to ensure security of our data. We look forward to continuing to work with interested researchers, policy makers and legislators to ensure the safety and privacy of genomic and other health-related information.