A new study led by the Spanish National Cancer Research Centre (CNIO) reveals that up to 20% of genes classified as coding (those that produce the proteins that are the building blocks of all living things) may not be coding after all because they have characteristics that are typical of non-coding or pseudogenes (obsolete coding genes). The consequent reduction in the size of the human genome could have important effects in biomedicine since the number of genes that produce proteins and their identification is of vital importance for the investigation of multiple diseases, including cancer, cardiovascular diseases, etc.
The work, published in the journal Nucleic Acids Research, is the result of an international collaboration led by Michael Tress of the CNIO Bioinformatics Unit along with researchers from the Wellcome Trust Sanger Institute in the United Kingdom, the Massachusetts Institute of Technology in the United States, the Pompeu Fabra University and the National Center for Supercomputing (BSC-CNS) in Barcelona, and the National Center for Cardiovascular Research (CNIC) in Madrid.
Since the completion of the sequencing of the human genome in 2003 experts from around the world have been working to compile the final human proteome (the total number of proteins generated from genes) and the genes that produce them. This task is immense given the complexity of the human genome and the fact that we have about 20,000 separate coding genes.
The researchers analyzed the genes cataloged as protein coding in the main reference human proteomes: the detailed comparison of the reference proteomes from GENCODE/Ensembl, RefSeq and UniProtKB found 22,210 coding genes, but only 19,446 of these genes were present in all 3 annotations.
When they analyzed the 2,764 genes that were present in only one or two of these reference annotations, they were surprised to discover that experimental evidence and manual annotations suggested that almost all of these genes were more likely to be non-coding genes or pseudogenes. In fact, these genes, together with another 1,470 coding genes that are present in the three reference catalogs, were not evolving like typical protein coding genes. The conclusion of the study is that most of these 4,234 genes probably do not code for proteins.
The study is already paying off, according to the scientists. « We have been able to analyze many of these genes in detail, » Tress explains, « and more than 300 genes have already been reclassified as non-coding. » The results are already being included in the new annotations of the human genome by the GENCODE international consortium, of which the CNIO researchers are part.