Public DNA databases have become powerful tools for solving genetic mysteries in a few years. These databases have been used to find long-lost relatives and help adoptees locate their biological parents. The best-known use of these tools is to help police solve cold cases. They can do everything from identifying anonymous bodies to detecting criminal suspects. The Golden State Killer, for example, was determined using a DNA sample from nearly 40 years ago that linked his genetic profile to distant cousins’ profiles posted on GEDMatch.

These breakthroughs, while proving the promise of genetic genealogy research, have also raised privacy concerns. Not just for those who have shared their genome profiles but millions of others who have not even taken a test.

GEDMatch, MyHeritage, and other popular DNA sites have 1.4 and 1.3 million members, respectively. However, the number of people who could be identified through these databases is much higher.

In theory, two genomes can be related if they have shared ancestors in the last five generations, says Mine Erturk, a Ph.D. candidate at Stanford Graduate School of Business. Your DNA ties you to hundreds of distant cousins, who may share only a tiny amount of genetic material. Erturk explains, “This also affects the next five generations, including your children and grandchildren, who may not be even alive yet.”

A 2018 study published in Science found that data from MyHeritage could trace the ancestry of 60% of Americans. Just 2% of U.S. adult citizens uploading their DNA into a genetic database could allow 90% of the population to be reconstructed.

Erturk says genetic research doesn’t need to expose sensitive information to find a genealogical needle in a haystack. In a preprint, Erturk, along with her advisor, Associate Professor KuangXu, details a model for genomic search that minimizes privacy risks and maintains its effectiveness.

Erturk says that when she first learned about genetic databases, there was a discussion among academics regarding the privacy issue. Still, nobody was looking at the problem from a practical perspective. She and Xu are confident that their work is groundbreaking in an area with hotly debated legal and ethical dimensions.

They hope that by presenting a model to address the privacy issues of genetic search, more discussions and policy changes will be sparked. “The current system doesn’t explicitly take privacy risk into account,” Xu states. “Our first objective is to increase awareness about the importance of tracking privacy risks. We also want to suggest concrete steps towards a solution.

The Gene Genie

At the moment, public DNA databases are virtually unrestricted. They are also almost unregulated. Erturk and Xu claim that genetic data can be collected by companies looking to sell drugs or insurance companies screening their customers for inherited diseases. (People who have shared their genetic code may also be at risk of data breaches and attacks.) Hackers stole data from one million GEDMatch accounts last year. Some of that information was then used in phishing emails to target MyHeritage customers.

Erturk and Xu have proposed a new method of searching for genetic matches to protect DNA database users, their family networks, and the databases themselves. Genetic searches are currently “static,” meaning users can compare DNA samples with any database records until they find a match. Erturk and Xu developed a model that limits the access of searchers to a database. They would instead look for matches within small, selected data sets, using publicly accessible genealogical records such as birth certificates and marriage certificates to refine and target their search.

Erturk explains his approach when searching for a match in a database like GEDMatch. “I will first examine genealogical records and identify a few people who may be related to me or could provide some clues. I will only investigate their genomes and not the entire database. Then, I will only expose a few people, not the whole database. If I fail, I return to my genealogy records and try to create another list, such as ten people. I repeat this process in sequence.

This approach limits the searcher’s ability to access sensitive data. It also expands the search until the target is reached. Erturk and Xu claim that the mathematical framework described in their paper is controlled precisely and “vastly exceeds” static searching in optimizing the tradeoff between search time and privacy.

Old Rules, New Rules

Erturk faced a unique challenge when she wanted to test the model’s efficacy on real-world data. Erturk wanted to use an accurate genealogical database but didn’t want to violate anyone’s privacy. Her solution was to use the interconnected family trees of over 2,500 members of European royal households. She says that there are no privacy concerns since the family tree is public and known by everyone.

Erturk’s research builds on the literature on search problems. These often include scenarios where searchers look for hidden targets, such as terrorists or submarines. To their knowledge, Erturk’s and Xu’s genetic search model is the first time a private dimension has been incorporated into this problem.

Although their analysis is based on advanced mathematics, the basic concept will sound familiar to anyone who has watched a police procedural such as The Wire. Xu likens it to a phone tap: If cops want to listen in to a suspect’s calls, it isn’t practical (or legally) to listen to every phone line. Instead, They must target their search and obtain a court warrant before gathering evidence.

Xu believes criminal investigators searching for DNA matches within public databases should be subject to similar restrictions preventing them from sifting through vast amounts of data. Only Montana and Maryland currently have laws that regulate the use of genetic genealogy in law enforcement.

Erturk and Xu’s paper does not include explicit policy recommendations. Still, they view their model as a step in answering many legal and logistical questions regarding how our data is stored. The growth of DNA-based private data collection could require us to change our notion of privacy.

“Because of our close ties, society must consider genetic privacy a collective duty,” Xu explains. “You must protect your mother, father, son, daughter, and cousins.