Why the human genome was never completed
No human genome has ever been read in its entirety before. This year, scientists expect to pass that milestone for the first time.
Before the end of 2023, you should be able to read something remarkable. It will be the story of a single individual, who they are and where they come from – and it will offer hints about what their future holds. It probably won't be the most entertaining read on first glance, and it will be very, very long. But it will be a seminal moment – the publication online of the entire genome of a human being, end to end with no gaps.
At this point you may feel that you have heard this before. Surely the human genome was published decades ago? Isn't that all done?
It was, in fact, never finished. The first draft of the human genome was released in 2001, before a consortium of international scientists of the Human Genome Project announced that they had "completed" the job with a finished sequence in 2003. Assembled from chunks of various people's DNA, this became the "reference" sequence against which all other human DNA could be compared.
It was certainly the best that could be done at the time, but had major gaps and errors. Later releases improved on it, but many of the problems persisted. Only in the last few years has technology advanced to the point that it is possible to read the entire human genome, without gaps and with minimal errors. But these have all been composites, using DNA drawn from multiple individuals. This year, for the first time, the entire genome of a single human being – a man named Leon Peshkin – is due to be released.
This complete, single human genome will be a monumental technical achievement. Only 70 years have passed since the double-helix structure of DNA was first revealed, thanks in part to a grainy black and white image taken by Rosalind Franklin, transforming our understanding of how genetic information is stored. Today we have the capability to read the entire genetic 'textbook' that makes a person unique.
But the geneticists involved say it is also a beginning, not an end. They now want to sequence the genomes of people from around the world, to build up a true picture of our species' genetic diversity. They want to understand what the previously unsequenced sections of DNA are doing. And they want to roll out end-to-end genome sequencing in clinics, to help doctors diagnose and treat us when we get sick.
In short, the human genome will never be complete. We will never be done reading it.
The first human genome
The original Human Genome Project (HGP) was one of the biggest scientific projects ever attempted, costing about $3bn (£2.5bn) . Launched in 1990, its aim was to read all the DNA the average human carries in their cells. The first draft sequence was published just over a decade later, in 2001. Simultaneously, another version of the genome was published by Celera Genomics.
What is a genome?
The genome is often compared to a kind of book, written in the alphabet of DNA rather than English. The DNA alphabet only has four letters: A, C, G and T. They each represent different molecules called bases that are strung along the length of the DNA molecule. When read correctly, a particular sequence of these letters forms a gene.
The job of translating this information falls to the molecular machinery inside our cells. Some genes provide the information needed to make an enormous variety of different proteins that perform assorted functions important to life, while other sections of DNA have regulatory functions. What the Human Genome Project team figured out was the exact order of the bases along the DNA, something like CGATTTCCGAAAA – and so on for over three billion letters.
The human genome promised many things. We would discover what the genes were doing, especially those involved in diseases. This would enable a future of personalised medicine, in which we all received treatments shaped around our genetic makeups. The genome also promised insights into our evolutionary origins: exactly how are we different from our closest living relatives, chimpanzees and bonobos?
Some of this has happened, some of it hasn't. We certainly know a lot more about the functions of many genes and their roles in diseases ranging from breast cancer to schizophrenia. However, in practice it has turned out that most diseases are affected by hundreds of genes, so genomic medicine hasn't got very far. Few inherited diseases are caused by a single faulty gene, and the use of genetic screening to detect those at risk of rare conditions is largely only used for those considered at the highest risk.
In contrast, genetics has reshaped our understanding of human evolution, for instance revealing that our ancestors interbred with other hominins like Neanderthals.
Meanwhile, lurking in the background has been the awkward fact that the human genome was never actually finished. While geneticists have been cleaning it up ever since the first draft was published, most recently in February 2022, some parts of the genome remained out of reach.
Repeats, repeats and more repeats
The problem is that some DNA is extremely repetitive. Certain stretches of the genome repeat the same sequences over and over, sometimes for thousands of bases.
The repetitive DNA often turns up in the same bits of the genome. Our DNA isn't stored in one long continuous rope, but is instead split into smaller chunks called chromosomes. These are X-shaped (apart from one Y-shaped chromosome carried by men) and there are 23 pairs of them in humans. Each has repetitive DNA at the tips of its four arms – the telomeres – and at the central cross, the centromere. Both are important.
The telomeres are thought to act as protective caps, and damage to them has been linked to ageing. Meanwhile, the centromeres are crucial to the process of cell division that underlies growth and reproduction, and which goes wrong in cancer. Reshuffling of the DNA in the centromeres has been implicated in the development of some cancers.
However, the Human Genome Project could not sequence the repetitive DNA, and didn't try. Their method was not up to the challenge. They did not read the entire genome in one go, but instead cut it up into small chunks a few hundred bases long, read those, then stitched them back together with a computer. This would not work for repetitive sections, because the computer had no way to figure out what order the segments fitted together. It would be like trying to do a jigsaw that was all cloudless blue sky.
"Eight per cent was missing from the officially finished version in 2003," says Adam Phillippy, head of genome informatics at the National Human Genome Research Institute in Bethesda, Maryland.
And so our repetitive DNA sat, almost entirely unread, for 20 years. Then in 2021 Phillippy and his colleagues announced they had read it all.
End to end
The Telomere-to-Telomere (T2T) consortium was not some grand project dreamt up by major funding agencies and backed by billions of dollars. "This was really a grassroots effort that was driven through a pandemic," says Karen Miga, a geneticist at the University of California, Santa Cruz. For many genomics experts, "we came out of nowhere".
The key breakthrough was the ability to accurately read much longer stretches of DNA, says Evan Eichler, professor of genome sciences at the University of Washington in Seattle. Other long-read technologies had been developed, but even five years ago they were not considered to be accurate enough. So improving the accuracy was "a key development".
So was the ability to conduct "ultra-long" reads of sequences involving more than 100,000 bases. "What that was important for was actually bridging across very large complex regions of the genome," says Eichler.
T2T's first big success came in July 2020, when they released the complete sequence of a human X chromosome. At the time, the best X chromosome sequence had 29 gaps and the T2T team filled all of them. The following year they released the complete sequence of chromosome 8.
Those studies just represented what the team had got through peer review in a journal. In reality, they were much further ahead. In 2021 they also released a preprint titled "The complete sequence of a human genome", in which they filled in the missing 8%.
Reading the repeats
This was still not quite a complete genome. The team had used "a little bit of a trick, some people called it a cheat", says Eichler.
Most of the cells in our bodies contain two copies of every chromosome: one from our mother, one from our father. This makes it harder to reassemble chunks of sequence in the computer, because the two copies are ever so slightly different. To get around this difficulty, T2T used unusual cells that have two copies of the father's DNA, which are near-identical.
The cells came from a hydatidiform mole: a kind of failed pregnancy. Eggs and sperm only have one copy of each chromosome, so when a sperm fertilises an egg the resulting embryo has two copies. However, sometimes an egg loses its DNA and then gets fertilised. The cell then duplicates the DNA from the sperm. Hydatidiform moles form dangerous growths, rather like cancers, that must be removed.
This was what T2T had sequenced. One perspective is that they had only read half the genome, because the full genome has two copies of everything. However, their sequence was also a clear improvement on the best available. It added over 200 million letters to the human genome, and 2,000 extra genes.
Having a complete genome means it should finally be possible to work out what the long repetitive sections of DNA are doing, says Miga. "Now that we have these maps, I am super excited to know what lives in these regions," she says. "What's their major function? And if something goes wrong in those regions, how can that contribute to understanding human disease and human health?"
Beyond the Genome
It has been 20 years since the Human Genome Project was "completed". But it quickly became apparent that the efforts to sequence and map the human "book of life" was only just the beginning. Far from closing the question of what makes our bodies tick and why they do so differently, research on the human genome has revealed a far more complex picture than anyone could have imagined. Beyond the Genome examines the paradigm shift in our understanding of our genetics in the past two decades, including just how far-reaching the influence of our genes can be and how we in turn can influence our own DNA through health and lifestyle.
The repetitive DNA includes a lot of sequences that can move around in the genome, dubbed "mobile DNA".
"We're finding that a lot of these elements have contributed to evolutionary novelty," says Rachel O'Neill, a molecular geneticist at the University of Connecticut in Storrs. Many evolutionary leaps including the placenta, why we lost our tails and certain brain functions "can be attributed to these types of mobile DNA".
Meanwhile, Eichler highlights "segmental duplications" in which long stretches of DNA that can include multiple genes are duplicated wholesale. These sequences can evolve unusually rapidly. "It's like quicksilver," says Eichler. "They result in the emergence of new genes that are human-specific," he says. "Those genes are contributing disproportionately to the differences that make us human." While the human and chimpanzee genomes are 99% identical, segmental duplications are one way that important differences can arise, and the original human genome sequence largely missed them.
Neuroscientists have shown that some of these duplicated genes are important in brain function – but geneticists could not study them in detail because they were in repeats that older genomes did not capture.
The T2T sequence was eventually published in a special issue of the journal Science in March 2022. But by then the team was already moving forward.
The big remaining gap was the Y chromosome, which is found only in males. Sperm will usually only carry one sex chromosome – either an X or a Y. Because the hydatidiform mole DNA used by T2T came from a sperm that happened to be carrying an X chromosome, the Y was missing.
To really finish the job, the team needed a male donor. They turned to Leon Peshkin.
DNA donor
Peshkin is a systems biologist at Harvard Medical School in Boston, Massachusetts. Much of his research is focused on understanding the mechanisms of ageing and how to slow them. "I happen to believe that the duration of normal healthy human life is completely arbitrary, that it could be radically extended," he says. Genomics is a big part of his work and he wants to drive it forwards. As a result, he has donated his DNA to a string of major sequencing projects. "My genome is by far the best characterised individual genome on the planet today, [or] at least definitely the best characterised public genome – not just of humans, of any species," he explains.
Peshkin's first donation was to the Personal Genome Project, launched in 2005. Its purpose was to elicit volunteers who would share their DNA publicly, to enable more rapid and effective research – and to overcome fears about potential abuses of genomic data.
A decade later, Peshkin's DNA was re-used by the Genome in a Bottle (GIAB) project. This aimed to sequence the genomes of cell lines that could be grown indefinitely in the lab, making it easier to study the effects of mutations. Peshkin's genome was desirable because he had also signed up his parents for the Personal Genome Project, giving GIAB a mother-father-son trio.
Peshkin has no regrets about his choices, although he does mention one inconvenient consequence. "I cannot go into [any] lab working with my cells, because if immortalised cells end up in me my immune system will not recognise them," potentially sending his immune system into a dangerous overdrive. "I could die from this." He is happy for his DNA to be sequenced again by T2T, this time end-to-end.
In December 2022, T2T released another preprint paper, describing the end-to-end sequence of Peshkin's Y chromosome. Because it has so many repeats and complexities, more than half of the chromosome was missing from previous genomes. The new sequence added over 30 million letters, including dozens of genes.
The team is now working on Peshkin's entire genome, including both copies of every chromosome. "We have finished sequencing and assembling it," says Phillippy. The resulting genome is "complete and gapless", and the duplicate chromosomes are fully resolved. All that remains is checking. "There's a handful of errors we can see, so we have to go through the time-intensive task of going through each of them," says Phillippy. He says it should be out this year.
The pangenome
Is that it, then? Will the human genome be completed this year?
The answer is no, because there is no one human genome. Everyone's DNA is different and those differences matter. We won't really understand the genome until we have a record of how it varies from population to population.
The original HGP tried to deal with this by taking blood samples from multiple people, all from New York. The sequence that was published was a mash-up of all of them. In effect, they tried to present an "average" genome – but one American city doesn't remotely represent the full spectrum of human genetic diversity.
That's why many members of the T2T consortium have also signed up to another project: the Human Pangenome Reference Consortium. The aim is to sequence hundreds of people from around the planet. The genomes won't be quite end-to-end, because they are sacrificing a little bit of completeness in favour of automated methods that will allow them to include more people. In July 2022, the team released a preprint describing 47 sequenced genomes, which they combined to produce a draft pangenome.
They are now forming partnerships with researchers around the world, including the global south. "We don't want there to be just a monopoly or one place that directs all this," says Eichler. "Because I think it's better to have genomes, especially some of these populations where we don't understand the genetic diversity very well, being done in their own communities by their own people."
The pangenome endeavour is already yielding results. Phillippy is co-author on a study released in January that identifies a mechanism for a common genetic abnormality. About one new-born baby in 1,000 has a "Robertsonian translocation" in which two chromosomes fuse. If no genetic material is lost, the person's health is unaffected, but, in some cases, it can lead to conditions such as Down's syndrome.
It turns out there is a conserved stretch of DNA – one that tends to be similar across different species – that is found on multiple chromosomes. This can confuse the cell's mechanisms for copying DNA and cause the chromosomes to fuse. The crucial sequence lies in a region that is both repetitive and highly variable between individuals, so could not be studied without multiple end-to-end genomes.
Findings like this explain why many of the team want to see long-read sequencing rolled out in hospitals. "My endpoint is when we can replicate these T2T genomes for any patient in the clinic," says Phillippy. "The methodology we've been developing is aimed at that." The cost of genome sequencing has been falling dramatically for decades, so the T2T project cost a fraction of what the original HGP did.
It's clear that we still have much to learn from our genomes. As new techniques unlock more of the genomes secrets and allow more to be sequenced, there's no finish line in sight. "As long as there's humans, the human genome project will continue," says O'Neill.
--