The Open Human Genome, twenty years on

June 26, 2020, by Tim Hubbard

On 26th June 2000, the “working draft” of the human genome sequence was announced to great fanfare. Its availability has gone on to revolutionise biomedical research. But this iconic event, twenty years ago today, is also a reference point for the value and power of openness and its evolution.

Biology’s first mega project

Back in 1953, it was discovered that DNA was the genetic material of life. Every cell of every organism contains a copy of its genome, a long sequence of DNA letters, containing a complete set of instructions for that organism. The first genome of a free-living organism – a bacteria – was only determined in 1995 and contained just over half a million letters. At the time sequencing machines determined 500 letter fragments, 100 at a time, with each run taking hours. Since the human genome contains about three billion letters, sequencing it was an altogether different proposition, going on to cost of the order of three billion dollars.

A collective international endeavour, and a fight for openness

It was sequenced through a huge collective effort by thousands of scientists across the world in many stages, over many years. The announcement on 26th June 2000 was only of a draft – but still sufficiently complete to be analysed as a whole. Academic articles describing it wouldn’t be published for another year, but the raw data was completely open, freely available to all.

It might not have been so, as some commercial forces, seeing the value of the genome, tried to shut down government funding in the US and privatise access. However openness won out, thanks largely to the independence and financial muscle of Wellcome (which paid for a third of the sequencing at the Wellcome Sanger Institute) and the commitment of the US National Institutes of Health. Data for each fragment of DNA was released onto the internet just 24hrs after it had been sequenced, with the whole genome accessible through websites such as Ensembl.

Openness for data, openness for publications

Scientists publish. Other scientists try to build on their work. However, as science has become increasingly data rich, access to the data has become as important as publication. In biology, long before genomes, there were efforts by scientists, funders and publishers to link publication with data deposition in public databases hosted by organisations such as EBI and NCBI. However, publication can take years and if a funder has made a large grant for data generation, should the research community have to wait until then?

The Human Genome Sequence, with its 24-hour data release model was at the vanguard of “pre-publication” data release in biology. Initially the human genome was seen as a special case – scientists worried about raw unchecked data being released to all or that others might beat them to publication if such data release became general – but gradually the idea took hold. Dataset generators have found that transparency has generally been beneficial to them and that community review of raw data has allowed errors to be spotted and corrected earlier. Pre-publication data release is now well established where funders are paying for data generation that has value as a community resource, including most genome related projects. And once you have open access data, you can’t help thinking about open access publication too. The movement to change the academic publishing business model to open access dates back to the 1990s, but long before open access became mandated by funders and governments it became the norm for genome related papers.

Big data comes to biology, forcing it to grow up fast

Few expected the human genome to be sequenced so quickly. Even fewer expected the price to sequence one to have dropped to less than $1000 today, or to only take 24 hours on a single machine. “Next Generation” sequencing technology has led to million-fold reductions in price and similar gains in output per machine in less than 20 years. This is the most rapid improvement in any technology, far exceeding the improvements in computing in the same period. The genomes of tens of thousands of different organisms have been sequenced as a result. Furthermore, the change in output and price has made sequencing a workhorse technology throughout biological and biomedical research – every cell of an organism has an identical copy of its genome, but each cell (37 trillion in each human) is potentially doing something different, which can also be captured by sequencing. Public databases have therefore been filling up with sequence data, doubling in size as much as every six months, as scientists probe how organisms function. Sequence is not the only biological data type being collected on a large scale, but it has been the driver to making biology a big data science.

Genomics and medicine, openness and privacy

Every individual’s genome is slightly different and some of those difference may cause disease. Clinical geneticists have been testing Individual genes of patients to find for cause of rare diseases for more than twenty years, but sequencing the whole genome to simplify the hunt is now affordable and practical. Right now our understanding of the genome is only sufficient to inform clinical care for a small number of conditions, but it’s already enough for the UK NHS to roll out whole genome sequencing as part of the new Genome Medicine Service, after testing this in the 100,000 genomes project. It is the first national healthcare system in the world to do this.

How much could your healthcare be personalised and improved through analysis of your genome? Right now, an urgent focus is on whether genome differences affects the severity of COVID-19 infections. Ultimately, understanding how the human genome works and how DNA differences affect health will depend on research on the genomes of large numbers of individuals alongside their medical records. Unlike the original reference human genome, this is not open data but highly sensitive, private, personal data.

The challenge has become to build systems that can allow research but are trusted by individuals sufficiently for them to consent to their data being used. What was developed for the 100,000 genomes project, in consultation with participants, was a research environment that functions as a reading library – researchers can run complex analysis on de-identified data within a secure environment but cannot take individual data out. They are restricted to just the statistical summaries of their research results. This Trusted Research Environment model is now being looked at for other sources of sensitive health data.

The open data movement has come a long way in twenty years, showing the benefits to society of organisational transparency that results from data sharing and the opportunities that come from data reuse. The Reference Human Genome Sequence as a public good has been part of that journey. However, not all data can be open, even if the ability to analyse it has great value to society. If we want to benefit from the analysis of private data, we have to find a middle ground which preserves some of strengths of openness, such as sharing analytical tools and summary results, while adapting to constrained analysis environments designed to protect privacy sufficiently to satisfy the individuals whose data it is.

• Professor Tim Hubbard is a board member of the Open Knowledge Foundation and was one of the organisers of the sequencing of the human genome.

Tim Hubbard

Website | + posts

Tim Hubbard is Professor of Bioinformatics and Head of Department of Medical and Molecular Genetics at King’s College London. He is also Head of Genome Analysis at Genomics England, a company established by the UK government to execute the 100,000 Genome Project, which aims to mainstream the use of whole genome sequence analysis for treatment in the UK National Health Service (NHS). From 1997-2013 he worked at the Wellcome Trust Sanger Institute where he was one of the organisers of the sequencing of the human genome. In 1999 he co-founded the Ensembl project to analysis, organise and provide access to the human genome and from 2007 led the GENCODE project to annotate the structure of all human genes. He is an advocate of the benefits of open access and open data release for science and society as a whole and has served on multiple national information access advisory boards including Europe PMC (PubMedCentral) the repository for open access publications. He received his BA from Cambridge University (UK), and PhD from Birkbeck College, University of London (UK).