Making the jump from GRCh37 to GRCh38

11/12/2021 | 0 min read

Jumping from GRCh37 to GRCh38 –one giant leap for clinical genetics, one small step with Congenica

The sequencing of the initial drafts of the human genome in 2001 revolutionized the field of genomics, and since then many drafts and updates have attempted to correct errors or fill in complex regions that were not addressable with first-generation sequencing tools[1].

The recent Telomere-to-Telomere (T2T)Consortium report, which published “the first truly complete 3.055 billion base pair sequence of a human genome,”[2] has again brought into focus the need for clinical labs to use the latest consensus sequences for their analyses.

While the switch to this latest consensus sequence is likely still a long way off due to the need to use both long-and short-read sequencing technologies, the fact that most clinical labs are still using the GRCh37 reference sequence, released in 2009, is clearly a concern –especially since many resources are now switching to the GRCh38 reference sequence, which was published in 2013.

Why switch?

There are a number of compelling reasons:

Academic research has moved firmly to use GRCh38, and some resources, such as Decipher, have already stopped working on GRCh37.
The UK’s 100,000 Genomes Project and the NHS Genomic Medicine Service (GMS) are working on GRCh38, and the consensus sequence is also recommended by the Broad Institute (makers of GATK).[3]
GRCh38 corrected many errors and alignment issues that were previously present, making it a higher quality, less noisy assembly which in turn makes it possible to detect variants that may be missed with the GRCh37 assembly.[4] As an example, when Steward et al. reviewed old epilepsy cases, three additional clinically relevant variants were identified in the revised SCN1A gene model.[5]

The general shift in usage to GRCh38 and the greater chance of detecting variants of interest are contributing to an increased desire for clinical genetics teams to make the jump to GRCh38. However, the workload for switching remains challenging.

Making the switch and avoiding pitfalls: the devil is in the details

Laboratories that perform genetic testing typically maintain a historic repository of variants that they’ve analyzed and reported on, which informs their analysis of new cases. These data are often stored in isolated databases, spreadsheets and/or text files and must be converted from different assemblies and nomenclatures before it can be applied in the new environment.

Under the best of circumstances, the task of conversion can be daunting. Add to that the oftentimes manually compiled nature of the data and the opportunity for error is significant. Especially when those errors can be compounded by inadvertent “correction” by the software tools themselves -a report by Ziemannet al. in the journal “Genome Biology” showed that one-fifth of all papers submitted with supplemental gene lists in Excel files contain incorrect gene names! [6]

It turns out that Excel has a habit of converting gene names to date format which can cause headaches when moving knowledgebases into new platforms. Some common unwanted conversions include:

SEPT2 [Septin 2] being converted to 2-Sep
MARCH1 [Membrane-Associated Ring Finger C2HC4) 1 E3 Ubiquitin Protein Ligase] being converted to 1-Mar

These errors often make it even more difficult to move knowledge bases between consensus sequences.

37 - 38 graphic (2)

Congenica is here to help

If you are already using the Congenica platform and are using GRCh37 then the transition to GRCh38 can be made overnight. All gene annotations provided by the platform will be automatically migrated over –so you don’t need to worry about losing data. The platform supports both consensus sequences and provides an identical experience making the transition painless.

If you are new to using Congenica, the clinical decision support platform chosen by many of the world’s leading clinical laboratories and the NHS Genomic Medicine Service, then the process is still straightforward. You’ll instantly gain access to all provided annotations mapped to the GRCh38 reference sequence.

In both cases, collating and combining your laboratory’s external knowledgebases of previously interpreted variants for re-use within the platform is often the biggest challenge -for example, the data repository of one of our customers using GRCh37 consisted of 130 spreadsheets and thousands of word documents.

Once collated, their data was run through a series of customized scripts to check the integrity and identify formatting errors and misassignment of gene names so the data could be corrected before being converted to GRCh38 coordinates, uploaded, and made available for use.

Ensuring that legacy knowledgebase data is valid and available for use is critical for established laboratories –no matter the nomenclature that was used when the data was first collected. Even old data stored in gene-based rather than genome-based coordinate systems can be migrated and mapped as part of a customized transition process.

Once the knowledgebase is uploaded, variant identifiers and annotations are migrated to the consensus sequence of choice and the variant name is updated in line with the new nomenclature used by the consensus sequence.

While the priority is usually bringing lists or databases of previously interpreted variants on-board, there may be unsolved clinical cases that could benefit from realigning and reanalyzing the original FASTQ data against the latest consensus data. As an example, there are some potentially pathogenic epilepsy variants that cannot be mapped in GRCh37 but can be mapped in GRCh38 because of the better-quality sequence and enhanced annotation.

Re-running unsolved cases from scratch is another area that Congenica can support to ensure that variants that might have been masked by incomplete or noisy consensus sequence data can now be uncovered.

Conclusion

The transition to GRCh38 (and eventually beyond) provides new analytical opportunities and benefits, including the detection of additional variants that would have been missed with GRCh37.

Congenica can help support that transition, ensuring that your laboratory’s historic knowledgebases and curated variant lists are migrated to the new consensus sequence.

To learn more about how Congenica can support and facilitate your transition from GRCh37 to GRCh38, please contact us. Our experts will be happy to talk with you.

References

[1]

Sergey Nurk et al. The complete sequence of a human genome. bioRxiv 2021.05.26.445798; https://doi.org/10.1101/2021.05.26.445798

[2]

Andrew P. Han, Telomere-To-Telomere Team Assembles Complete Human Genome En Route to Reference Pangenome. Genomeweb.com, June 03, 2021 https://www.genomeweb.com/sequencing/telomere-telomere-team-assembles-complete-human-genome-en-route-reference-pangenome

[3]

GATK Technical Documentation / Glossary: Human genome reference builds –GRCh38 or hg38 -b37 -hg19 https://gatk.broadinstitute.org/hc/en-us/articles/360035890951-Human-genome-reference-builds-GRCh38-or-hg38-b37-hg19

[4]

Li H, Dawood M, Khayat MM, Farek JR, Jhangiani SN, Khan ZM, Mitani T, Coban-Akdemir Z, Lupski JR, Venner E, Posey JE, Sabo A, Gibbs RA. Exome variant discrepancies due to reference-genome differences. Am J Hum Genet. 2021 Jul 1;108(7):1239-1250. http://doi.org/10.1016/j.ajhg.2021.05.011

Epub 2021 Jun 14. PMID: 34129815; PMCID: PMC8322936.

[5]

Charles Steward et al. Re-annotation of 191 developmental and epileptic encephalopathy-associated genes unmasks de novo variants in SCN1A. NPJ Genom Med. 2019 Dec 2;4:31.
http://doi.org/10.1038/s41525-019-0106-7

[6]

Ziemann, M., Eren, Y. & El-Osta, A. Gene name errors are widespread in the scientific literature. Genome Biol 17, 177 (2016). https://doi.org/10.1186/s13059-016-1044-7

Learn more about the Genome Reference Consortium (GRC).