How to store data for 1,000 years
Most current data storage systems eventually stop working. But there are alternatives on the horizon.
"You know you're a nerd when you store DNA in your fridge."
At her home in Paris, Dina Zielinski, a senior scientist in human genomics at the French National Institute of Health and Medical Research, holds up a tiny vial to her laptop camera for me to see on our video call. It's hard to make out, but she tells me that I should be able to see a mostly clear, light film on the bottom of the vial – this is the DNA.
But this DNA is special. It does not store the code from a human genome, nor does it come from any animal or virus. Instead, it stores a digital representation of a museum. "That will last easily tens of years, maybe hundreds," says Zielinski.
Research into how we could store digital data inside strands of DNA has exploded over the past decade, in the wake of efforts to sequence the human genome, synthesise DNA and develop gene therapies. Scientists have already encoded films, books and computer operating systems into DNA. Netflix has even used it to store an episode of its 2020 thriller series Biohackers.
The information stored in DNA defines what it is to be human (or any other species for that matter). But many experts argue it offers an incredibly compact, durable and long-lasting form of storage that could replace the many forms of unreliable digital media available, which regularly become defunct and require huge amounts of energy to store. Meanwhile, some researchers are exploring other ways we could store data effectively forever, such as etching information onto incredibly durable glass beads, a modern take on cave drawings.
But how long could this data really last, and can we really rely on it to store the reams of data now being produced by humanity for posterity?
***
As we move towards a more and more digitised world, our reliance on data is skyrocketing. Films, photographs, webpages, business documents, critical security records – everything we use is digitalised, and we are using increasingly more of it.
Most of the reams of data we have produced is stored as 1s and 0s on magnetic tapes such as hard drives, but this is far from an ideal solution. For one thing, demagnetisation is a huge issue – permanent magnets gradually lose their magnetic field over time, so to keep data reliably it's important to rewrite hard drives every few years. "It lasts on average maybe 10 to 20 years, maybe 50 if you're lucky and the conditions are perfect," says Zielinski.
Storing data also requires huge data centres which use large amounts of energy to keep things cool – not ideal in a world prone to energy crises. The problem is seen as significant – the US government's molecular information storage (Mist) programme, launched in 2019, aims to find an alternative to today's huge data storage facilities, for example.
"We're actually running out of hardware. I think that industry can't really keep up with generating enough hard disks and servers to store all this data on," says Zielinski.
But do we really need to keep all this data, and preserve it for so long?
People want to store data for the long term for a huge variety of reasons. One is science – researchers are generating unprecedented amounts of data, and the more they have, the better. Radio telescopes and particle accelerators like the Large Hadron Collider (LHC) at the European Organization for Nuclear Research (known as Cern) on the border of France and Switzerland, for example, generate reams of data, and scientists want to keep all of it, says Latchesar Ionkov, a computer scientist working on DNA storage at Los Alamos National Laboratory. The LHC alone produces 90 petabytes (90 million gigabytes) per year.
Mark Bathe, a professor of biological engineering at Massachusetts Institute of Technology, co-founded the start-up Cache DNA to make biomolecules widely accessible and useful. The global threats facing humanity compel us to preserve both human-made information, such as art and science, and the DNA of all living things on the planet, says Bathe. "That way, if life were to either be recreated here or otherwise transferred or imported from other planets and so forth, there would be records of what we did, and what we had," he says.
Many DNA storage researchers believe they have hit on the perfect storage medium for both widespread and incredibly long-term storage. We typically view DNA as a way to store genomic information, but many researchers are now excited about the possibility of storing the vast quantities of digital data currently choking up data centres across the world.
DNA is a natural choice here, says Bathe. "Nature has used DNA for many millennia to store information in the form of genomes," he says. "It's been around [for billions of years], it's something that you can kind of bank on. As long as that's the fundamental information storage medium of a species, like humans, then it's going to be something that we know what to do with."
Compare the fact that DNA has been optimised over the last 3.7 billion years or so to the information age, which really began in the 1950s, says Zielinski. "We've come pretty far in man-made technology, but it doesn't get much better than DNA in terms of efficiency – when we start as one cell, all the instructions are there to direct every single cell until you reach the nearly 30 trillion cells that make up a human."
What's more, the fact we can recover DNA fragments from million-year-old animals such as woolly mammoths that deliver meaningful data about their genomes shows DNA is incredibly durable, says Zielinski. The half-life of DNA – the time it takes to degrade by half – is around 500 years in a well-preserved fossil, which means the DNA would cease to be at all readable after around 1.5 million years.
However, DNA is incredibly fragile, and the conditions that lead to fossilisation are extraordinarily rare. "There are tonnes of ways to destroy it," says Olgica Milenkovic, a professor of electrical and computer engineering at the University of Illinois at Urbana–Champaign. Humidity, acids, and radiation all damage DNA. "But if it's kept cold and dry, it's good for hundreds of years."
You might also like:
Even better, DNA can be protected by encapsulating it inside other materials such as glass beads – mimicking how genetic material is protected within ancient fossils. Robert Grass, a researcher at ETH Zurich, Switzerland, and his team have shown these beads protect the DNA from both chemicals and heat.
Further protection could come from locating it in a physically safe place. Storing data critical for humanity in encapsulated DNA in an ice vault could mean "it can last forever, pretty much", says Milenkovic.
Another huge perk of DNA is that it is incredibly dense store of information, to an extent unmatched by any other man-made device. The estimated 33 zettabytes of data that humans will have produced by 2025 (that's 3.3 followed by 22 zeroes) could be squeezed into the size of a ping-pong ball with DNA storage, according to Ionkov. He believes storing this much information in DNA could be mere decades away.
DNA storage is also unlikely to ever become obsolete, unlike other man-made storage media – "which one of us still uses floppy disk?" asks Milenkovic. With DNA, we should always be able to read it. "With every man-made technology, you need a new device to read it," says Zielinski. "If DNA is obsolete, then we have other problems to worry about."
The immortality project
This article is part of The Immortality Project. To celebrate BBC Future's 10th birthday this year – and the wider BBC's 100th – we are exploring what it takes to have a legacy that lasts not just decades but millennia. From long-lived sandwiches to venerable knowledge, art and even religions, we'd like to know how some things survive for thousands or even millions of years, and use this insight to look at whether it is possible to leave a mark on the world that extends into the deep future. There will be articles, videos and experiments. One of us may or may not be fossilising their toenail clippings...
There are other perks to DNA storage, too. It has already piggybacked on research in medical science, such as gene therapy and synthetic biology, notes Milenkovic, and this will continue as that research advances. It would also use next to no energy to store.
Of course, there are huge challenges. As one 2018 paper put it, while DNA has "an enormous potential as a data storage device of the future, multiple bottlenecks such as exorbitant costs, excruciatingly slow writing and reading mechanisms, and vulnerability to mutations or errors need to be resolved".
***
The process of converting digital data into DNA basically consists of turning it into a DNA alphabet. DNA is made up of four molecules known as nucleotides or bases: adenine (A), cytosine (C), guanine (G), and thymine (T), joined together in different sequences in a long string. The most common way to turn digital information into DNA code simply requires converting the 0s and 1s of digital code into these four letters, then synthesising the DNA to match.
"You can use A to represent, for example, 00; T to represent 01; G to represent 10 and C, 11," says Milenkovic. "Then you can take any digital content that exists classically on a disk or a tape or a flash, and convert it into a four-letter alphabet."
DNA synthesis was the method used by the two breakthrough papers published in 2012 and 2013 which each stored around 700kB of data in DNA (the previous record was less than 1kB). In a 2017 paper, Zielinksi (then a researcher at the New York Genome Center) and her colleagues stored a scientific paper, one-minute film, computer operating system, computer virus and Amazon gift card – totalling around 2MB – on DNA using this method.
The huge barrier to storing lots of data on DNA, of course, is the cost, which is far higher than storing data on servers or hard disks. It cost Zielinksi $7,500 (£6,729) to store those five digital items.
The cost of DNA storage is "a bit of a moving target", adds Zielinski, as it depends on the synthesis method as well as the encoding scheme and how it is decoded. A reasonable estimate is around a few thousand dollars per megabyte (MB) to both encode and decode by sequencing, she says.
To convert this article and its pictures into DNA, for example, would mean initially compressing the data from roughly 20MB to around 500kB, applying an encoding scheme, then sending this off to a lab to synthesise it at a rough cost of $1,000 (£897). The lab would complete the laborious process of making it for me using a technique which adds one nucleotide at a time to each string of DNA. "The biggest bottleneck is actually synthesising that DNA," says Zielinski. "That's the biggest focus, reducing synthesis costs."
However, the resulting strands don't need to be perfect. If you're using it for data storage rather than medical procedures – which is what DNA synthesis was originally developed for – there could be a higher tolerance for errors. So the door is open for faster, less precise methods of synthesis. "You can handle errors in the data and still recover your files. And so we can handle a much messier synthesis," says Zielinski.
To be competitive with common digital media, says Bathe, the cost of DNA storage would have to come down by a factor of around a million. This is a long way off, but scientists are already working to increase how many DNA molecules can be written at the same time. "If you look at the electronics industry, they have seen that reduction in cost," adds Bathe. And the cost of DNA synthesis has already fallen significantly, he says.
Another option that avoid synthesis altogether is the possibility of storing data in naturally occurring DNA that has simply been edited. In 2020, Milenkovic's group edited DNA from the bacteria E. coli to store US President Abraham Lincoln's Gettysburg Address and an image of the Lincoln Memorial by creating a punch-card system to create holes (actually little nicks out of the nucleotides using gene editing systems such as Crispr and other nicking enzymes) in the bacteria's genetic sequence. This could end up being far cheaper than making totally new DNA molecules.
"It's a completely different paradigm – you don't store information in the sequence content in the composition of ATGCs, you store information in the presence of structural changes in the double helix," says Milenkovic. The original bacteria becomes the reference point for the code, and no synthesis is needed, which means the process should be cheaper and avoid the toxic byproducts associated with synthesising DNA, she says.
However, the price paid here is in the density of data that can be stored on a given strand of DNA. "We estimated roughly a 50-fold loss in density [compared to the DNA synthesis technique]."
Another experimental method for storing data in DNA, reported by Harvard scientists in 2017, involves feeding fragments of nucleotides to an already existing DNA strand in a living cell, which incorporates the DNA fragments as an immune defence mechanism. The team inserted Eadweard Muybridge's 1878 film clip of a galloping horse into a bacterium. "The trace is left in a living organism," says Milenkovic. As long as that organism exists, including its offspring, the information is stored – although it may become mutated over time, altering the information.
***
Because we can extract data from fossils, says Ionkov, we're pretty sure that DNA storage can last a long time. "So an interesting question is actually not how long the media, the DNA molecules will last, but are we going to be able to read the data in 1,000 years."
Ionkov's organisation is part of a group called DNA Data Storage Alliance, which is looking at how we can ensure we'll be able to decode the data in future centuries. One of its working groups, the Rosetta Stone Group, is looking at how to create a universal guide for how to read their DNA storage archive.
There are several challenges with reading DNA today. First, you need to sequence it. This involves using the common molecular technique PCR to make trillions of copies of the stretch of DNA you'd like to decode. Unfortunately, this can introduce mistakes. "Many of these errors can easily be handled in the decoding, when you decode that DNA back to your data," says Zielinski.
Next comes the sequencing itself, and there's a snag here too. Currently, sequencing is done in table-top machines which typically take several hours to run. So this form of data storage is not exactly a quick-access system.
One thing that would improve these waiting times is "random access" – the ability to dip in and out of the data to retrieve what you are looking for, so you don't have to sequence the whole lot. This has been demonstrated with DNA storage systems by adding a "barcode" to the end of the DNA strands.
However, the current DNA molecules being produced are fairly short – 150 or 200 base pairs – so using part of this space to simply identify the DNA strand via a barcode leaves even less space for writing the data you want to store, says Ionkov. "It's a pretty serious problem. But once the technology gets much better and we can write very long molecules with thousands or tens of thousands of nucleotides [base pairs], that problem will start disappearing."
In another method to improve random access, Bathe's team encapsulated DNA strands in silica beads labelled using short strands of nucleotides on the surface of the bead. "The same way you barcode products at a supermarket to be able to identify them uniquely, we barcode these little capsules of DNA, using nucleic acids," says Bathe.
It's not yet clear how we might integrate information stored in DNA into working computers. Bathe's team has experimented with creating a file system for the DNA. "That kind of converts the liquid or solid state of DNA information into something that is more akin to a computer hard-drive where you have the ability also to search through it with something like a search engine like Google," says Bathe. Even Microsoft is exploring how it could incorporate biomolecules into computer design.
***
However, widespread DNA synthesis would come with risks. People could try to use it to store other things than data. In theory, people could synthesise viruses or bacteria, says Zielinski – or even create someone's DNA and leave it at a crime scene. "There are actually checks in many of these pipelines that generate data that they will cross check it against known genomes to make sure there's nothing real in there, nothing harmful, like a sequence for pathogens," she says.
Bathe agrees that there are "enormous" privacy issues and risks. He notes that many companies are seeking to catalogue the DNA of everyone on the planet. Others have pointed out how frightening it is to imagine someone being able to hold the DNA sequences of billions of humans in a small data storage system. "We need to build technologies around it, because if we don't, we won't be able to mitigate those risks or understand them; it'll be a very unknown and uncontrolled entity," says Bathe.
Considering this, it's worth thinking about the alternatives to DNA data storage. Peter Kazansky, a professor in optoelectronics at the University of Southampton, has created an optical storage technology that he believes is a worthy contender – it can last for millions or even billions of years, he says.
The team works with femtosecond (one millionth of one billionth of a second) laser writing – etching information onto durable silica glass disks using a laser similar to the type used in eye surgery. The intense, short laser pulses are focused in a particular way to create a micro-explosion which makes a tiny hole in the glass. "We discovered that in these conditions very tiny nano-structures could be formed," says Kazansky. "And we use these structures to encode information."
The process is similar to how CD and DVDs are burned using laser light polymers or dye – but here the structures are tiny and incredibly stable, surviving temperatures up to at least 1,000C (1832F) and undamaged by radiation, says Kazansky. "One advantage of [our] storage, the main one, is durability; it can last almost forever," he says.
The technology produces information in five dimensions – on top of the usual three dimensions created by a hole, the orientation and shape of the hole can also be controlled, allowing denser data storage. This density could never approach that of DNA, but by increasing the number of layers in the etching it is slowly rising.
So far, documents including the Universal Declaration of Human Rights, Magna Carta, the King James Bible – and The Hitchhiker's Guide to the Galaxy – have all been stored using the technology. In 2018, Elon Musk sent an etching of Isaac Asimov's science fiction series Foundation into space aboard the Falcon Heavy rocket, while Microsoft has stored the entire 1978 Superman film in glass. The artist Mika Tajima has even stored "human emotion" data using this method – she collected and stored all the tweets posted in Japan in 2020.
"We use a process similar to what ancient people used – they made marks on stone with tools," says Kazansky. "It's a mechanical or physical change of material. So this kind of physical change or making holes in material is a very ancient way of securing information."
Similar to DNA storage, one of the main caveats to storing data in this way is writing speed. Kazansky says his team can now write at 500kB per second, up from at most 0.1kB per second in the initial experiment a decade ago. "To make it practical, you need a write speed of a million bytes (1,000kB) per second, at least," he says. Another barrier is reading the data, which currently needs to be done manually using an optical microscope. "To make it practical, you need to make a machine which will just take the sample, focus, move and read."
The device used to do the etching also currently fills a room, and uses a £100,000 ($112,000) laser, although Kazansky believes the size and cost could be brought down. And while very durable to temperature and radiation, encapsulating the glass in something strong may still be a good idea for anyone wanting to ensure its longevity – the glass itself could simply be broken with a stone.
"I think the etching is much less sensitive to any environmental conditions," says Zielinski. "So it's not as dense [as DNA], but it's still a very, very efficient way to store critical data, and you can certainly worry much less about it. Every storage device has its opportunities and advantages and disadvantages. And I think DNA could be complementary."
Other researchers are pursuing molecular options for encoding data that don't involve DNA, such as those using chains of other kinds of synthetic molecules which are easier and cheaper to synthesise. For example, a code can be created simply by controlling the mass of individual molecules, with different masses representing different combinations of 0s and 1s.
We already have the ability to encode digital data into DNA, encapsulate it and protect it for hundreds or potentially thousands of years. The real caveat here is choosing which data to do this with – or how to overcome the bottleneck of DNA synthesis to allow far larger amounts of data to be stored than we have so far. "I'm pretty excited about the DNA being used for storing data, [but] I think we need 20 more years," says Ionkov, although he notes that some companies believe that they will have a viable product in five years.
Zielinski believes humans will start using DNA in the next five to 10 years to store cold data that don't need to be accessed often, such as critical financial records or historical data. I ask her if one day we could be printing our own DNA on devices at home. "Absolutely, I think that will happen at some point."
--