Principles of Protein Structure: EBI

Showing posts with label EBI. Show all posts

Tuesday, 7 March 2017

Mapping the Evolution of Enzyme Function

The Institute of Structural and Molecular Biology, which combines the research endeavours of Birkbeck and University College London in these disciplines, runs a programme of weekly research seminars throughout the university terms. Each term’s seminars are linked by a theme, and the theme for the spring term of 2017 has been ‘Bioinformatics and Computational Biology’. Early in February, the Institute was delighted to welcome one of the UK’s foremost structural biologists, Professor Dame Janet Thornton, to give a talk in this series. Thornton was well known to many in the large audience, having spent the whole of the 1980s at Birkbeck, rising to be a professor in the School of Crystallography (now part of Biological Sciences). During the 1990s she held chairs at both Birkbeck and UCL and founded a biotech company, Inpharmatica, before leaving to direct the European Bioinformatics Institute (EBI) at Hinxton, near Cambridge. She has now stepped down from the directorship but maintains an active research group at the EBI.

The topic that Thornton chose to present was one that she had worked on throughout her long career: the structure, function and evolution of the enzymes. When she started studying proteins there were probably about 20 known structures. The PDB now holds well over 120,000 protein structures, and tens of thousands of these are of enzymes, so there is plenty of data to work with.

And enzymes are particularly easy to work with because their functions are so well characterised. Back in the 1960s an Enzyme Commission assigned a set of four numbers (‘EC numbers’) to each enzyme. There are six primary enzyme classes, each of which is divided into sub-classes and sub-sub-classes; the final number is a serial number that defines the enzyme’s substrate. So, for example, phosphoinositide phospholipase C is also known as EC 3.1.4.11; the 3 indicates that this enzyme is a hydrolase, the 1 that it acts on ester bonds and the 4 that it is a phosphoric diester hydrolase. The other top-level classes are the oxidoreductases (1); the transferases (2); the lyases (4); the isomerases (5); and the ligases (6). EC numbers define enzyme function rigorously, so referencing them in computer programs is straightforward.

Thornton and her group chose to focus on those enzymes that have a well-characterised catalytic function that is mainly involved in small-molecule metabolism. All enzymes with these characteristics were grouped into homologous superfamilies (that is, families of proteins with a clear evolutionary ancestor) and the members of each superfamily were annotated with EC numbers as a proxy for their function. For example, the superfamily of enzymes that are clearly related to phosphoinositide phospholipase C by structure and function includes not only enzymes classified as 3.1.4.11 but also sphingomyelin phosphodiesterases D (3.1.4.41) and phosphatidylinositol diacylglycerol-lyases (4.6.1.13). The two phosphodiesterases have the same chemistry (as specified by the first three EC numbers) but act on substrates with very different shapes, while the chemistry of the enzymes 4.6.1.13 differs significantly from the others.

In this example, comparing the structures of enzymes with the EC numbers 3.1.4.41 and 3.1.4.11 showed that active site residues involved in their reaction mechanism and the bound metal ion in each one that is necessary for catalysis superimpose very well, but the rest of the active site varied significantly to allow substrates with distinctly different sizes and shapes to bind. In contrast, the lyase 4.6.1.13 has a similar-shaped active site to 3.1.4.11 but no bound metal and different catalytic residues. In this case is likely that a single amino acid change, removing an aspartic acid residue and therefore a negative charge, has removed the ability of the enzyme to bind a metal ion and thus changed the reaction that the enzyme catalyses.

Enough data was available to group the enzymes in this superfamily, and in another 275, into phylogenetic trees to map out the evolutionary route taken within each superfamily and catalogue all possible evolutionary changes of function. Some of these are much more complex than the one outlined above. For example, the analysis showed that five classes of flavin-dependent mono-oxygenases with different chemistry were evolutionarily related. Here, the change in chemistry seems to have arisen not from a simple substitution of one amino acid for another but a change in the multi-domain architecture of the protein.

The group constructed an ‘EC exchange matrix’ from this data to show how many times each top-level EC class had changed into each other class during evolution. While most changes in chemistry left the top-level class – the basic type of the reaction – unchanged, every possible change had occurred at least once in evolutionary history. In fact, 11% of the changes catalogued were changes to top-level class. The diagram below illustrates this data in a series of six circles, one for each ‘original’ enzyme class, with the width of each strip indicating the number of transitions from one class to another: for example, the thick red strip going from the ‘top’ to the ‘bottom’ of the top left-hand circle illustrates that a lot of transitions from oxidoreductases (class 1) to transferases (class 2) have been observed.

An overview of functional evolution in enzymes. © Nicholas Furnham & Sergio Martinez Cuesta, EBI

They then looked in much more detail at the changes observed in the catalytic site of each superfamily during evolution, and found that active sites differ in ‘plasticity’. At one extreme there is the TIM barrel ‘superfold’, which is a scaffold that holds amino acids with different chemistry in similar positions to catalyse many different reaction types. At the other extreme, there are seven superfamilies in which the catalytic residues are 100% conserved. It is interesting to try to correlate sequence similarity with ‘functional similarity’, but this runs into the problem of how to define functional identity. With enzymes, any measure of functional similarity will include a contribution from the chemical similarity of the substrates and this is difficult to gauge, particularly as most of the best computational tools were written for commercial drug discovery and are therefore not in the public domain. Preliminary results suggest that there is some correlation, but it is much weaker than that between sequence and structural similarity.

Thornton summed up her lecture by re-stating that evolutionary changes to enzyme substrate specificity are much commoner than those to basic chemistry. Evolution has, however, given rise to an explosion in enzyme function. The EC system has catalogued a total of 2,994 unique enzyme functions, but only 379 different structures (CATH superfamilies) are known to have enzymatic activity. Most enzyme functions will therefore have evolved from another function, with each catalytic activity arising independently only a few times throughout evolutionary history. The evolutionary relationships within enzyme superfamilies are complex and there are many ways in which their function can diverge.

Much of the work Thornton presented has been described in a 2012 paper in PLoS Computational Biology; its lead author, Nick Furnham from the Thornton group at the EBI, is now a group leader at one of Birkbeck’s neighbouring colleges, the London School of Hygiene and Tropical Medicine. PPS students will learn much more about the structure, function and mechanisms of enzymes in section 10 of the course, ‘Protein Interactions and Function’.

The most recent paper from the Thornton group on this topic is:
Furnham N, Dawson NL, Rahman SA, Thornton JM, Orengo CA. Large-Scale Analysis Exploring Evolution of Catalytic Machineries and Mechanisms in Enzyme Superfamilies. Journal of Molecular Biology 428 (2016) p.253-267

Wednesday, 9 March 2016

Evolution and Assembly of Protein Complexes

Since 2003, all Birkbeck researchers in structural biology and allied disciplines have collaborated with colleagues at UCL in the Institute of Structural and Molecular Biology. The Institute holds a varied series of events throughout the year, including a programme of research seminars arranged termly around current themes in molecular and structural biology research. The theme for the spring term 2016 seminar programme has been ‘Protein Dynamics: from Folding to Function’; one of the first of the distinguished scientists invited to present their research under that theme was Sarah Teichmann from the EMBL – European Bioinformatics Institute and the Sanger Institute at Hinxton, near Cambridge, UK. She gave a fascinating talk that linked evolution and protein folding to the topic of Section 7 of the PPS course, quaternary structure (or the assembly of protein complexes).

Teichmann has won many awards in what is still quite a short research career, including the Biochemical Society’s Colworth Medal for ‘an outstanding research biochemist under the age of 35’ (2011) and the EMBO Gold Medal (2015). Last year she was elected a Fellow of the prestigious Academy of Medical Sciences, with a citation that commended her as representing ‘a new breed of scientists at the interface between computational and experimental molecular biology’. She is also an advocate for women in science and has written a children’s novel.

She began her seminar by asking two related questions: ‘how do protein complexes assemble?’ and ‘how do protein complexes evolve?’ and by misquoting the poet John Donne: ‘no protein [man] is an island’. Many proteins are functional only when bound to others to form complexes, and in the crowded environment of a cell each newly synthesised protein has only a limited amount of time to find its partners and form a stable complex. Much can be learned about the evolution and dynamics of complex formation by studying the complexes that are available in the Protein Data Bank. Her group’s evaluation of these structures has contributed to the software that the PDB uses to predict the functional biological unit (monomer, dimer or multimer) for each structure in the PDB, and has led to the 3Dcomplex.org database of protein complexes. This database provides a hierarchical classification of now over 30,000 protein quaternary structures. Each complex is represented using graph theory as a simple 2D figure or ‘mini-graph’, with each polypeptide chain as a node and each interaction surface between two chains as an edge. These little graphs make it easier to distinguish between topologies involving the same number of subunits: for example, a complex of six identical protein chains may be a simple hexamer with 6-fold rotational symmetry (such as the traffic ATPase [PDB 1g6o]) or a dimer of trimers with 32 symmetry (such as annexin XII [PDB 1aei]). The links here are to the pages describing those proteins in Section 7 of the PPS course material.

Alongside the hierarchy described in the 3Dcomplex database, protein complexes can be divided into two large groups: homomers, which consist of multiple copies of the same polypeptide chain, and heteromers involving different chains. (Haemoglobin, a tetramer with two alpha and two closely related beta chains, is arguably an intermediate between the main two types.) Teichmann spent the rest of her lecture addressing three related questions about the assembly of both homomers and heteromers:

i) Does the assembly of protein complexes drive evolution?
ii) What are the mutational mechanisms involved in complex formation?
iii) Can the principles of protein assembly be used to predict topologies that have not yet been seen?

Starting with the first question, from an evolutionary point of view the simplest complex to form is a homodimer with two copies of the same monomer; one mutation that turns part of a protein surface into a ‘sticky patch’ is all that is necessary to stabilise dimer formation. Not surprisingly, the homodimer is also the commonest type of quaternary structure found in the PDB. Once a protein has dimerised, additional monomers can be added to form larger complexes with cyclic symmetry, or the dimer itself can (for example) dimerise. The order in which the interfaces in a multimer formed during evolution can be predicted from the amount of surface area buried by the formation of each interface, with the largest surface areas being buried first. This simple rule applies to complicated assemblies as much as to simple ones, and to heteromers as much as to homomers. Therefore, for all but the very simplest structures, it is almost impossible to predict the form that a complex will take unless you know the order in which the subunits assemble. Joseph Marsh, a former postdoc in Teichmann’s group now working at the MRC Human Genetics Unit in Edinburgh, represents this here in an analogy with the assembly of flat-pack furniture, with and without instructions.

Illustration © Joseph Marsh, MRC Human Genetics Edinburgh

Teichmann tested some of her predictions of protein assembly pathways using mass spectrometry in collaboration with Professor Dame Carol Robinson’s group at the University of Oxford, and found that seven out of nine pathways and 22 out of 27 steps within those pathways had been predicted correctly. This hierarchy of subunit assembly can also be used to predict the evolution of a complex, so it is clear that the assembly of protein complexes can indeed drive evolution.

Turning to the second question, Teichmann used specific examples of protein families that take up different quaternary structures in different species, including the PyR family of bacterial pyrimidine operon attenuators, to explore the evolutionary mechanisms that take a protein from one that is most stable as a monomer to different multimeric forms. These can involve direct mutations at the interface between subunits (for example, making the protein surface ‘stickier’ or creating a salt bridge) and other so-called ‘allosteric’ mutations that change the protein structure to allow different interfaces to form. Often, the difference between (for example) a protein that is stable as a dimer and one that is stable as a tetramer will come down to changes in a few amino acids. In the case of the PyR attenuator family, mutations away from the interface drive a conformational change that is equivalent to the one that occurs when the protein binds DNA, and so stabilise multimer formation.

Finally, Teichmann considered the use of the assembly principles that she had outlined in predicting the form that a protein complex would take from scratch. Most basic steps in complex assembly, as described earlier, can be grouped into one of three categories: dimerization of one or more chains, adding an identical subunit or subunits to a complex (cyclization) and adding a different type of subunit. These can be combined in different ways to form a large number of possible quaternary structure topologies. So far, about 120 different topologies are represented in the PDB, with four or five new ones being added each year, and the vast majority of these fit into one of Teichmann’s topologies. She assembled all the predicted topologies, including those not yet observed, into a ‘periodic table of protein complexes’ (S.E. Ahnert et al., Science 350, aaa2245 (2015)). This table has already been seen to correctly predict the topology of some newly determined complexes that were not included in the original list.

Wednesday, 7 May 2014

The Many Uses of Bioinformatics

Every year, Birkbeck hosts a lecture by a distinguished scientist to honour the memory of the founder of its Crystallography Department, J.D. Bernal. “Sage” as he was called by all who worked with him had an enormous range of research interests spanning both science and society; he is widely considered one of the most brilliant scientists never to have won a Nobel Prize. The 2014 Bernal Lecture, held on March 27, was given by Professor Janet Thornton, the director of the European Bioinformatics Institute (EBI) at Hinxton near Cambridge.

Introducing the lecture Professor David Latchman, Master of Birkbeck, described it as a unique occasion: the only time he has introduced as a guest lecturer someone who he had interviewed for a job. Thornton includes both Birkbeck and UCL on her CV: appropriately, her last post in London was that of Bernal Professor, held jointly at both colleges. She moved on to “even greater heights” as director of one of Europe’s top bioinformatics institutions in 2003.

Thornton began her lecture with a quote from Bernal: “We [academics] can go on being useless up to a point, with confidence that sooner or later some use will be found for our studies”. That quote is of particular relevance to the subject that she has made her own: bioinformatics. She had already begun her research career in 1977, when Fred Sanger invented the process that was used to obtain the DNA sequence of the human genome. That endeavour, which was completed in 2003, took over ten years and cost billions of dollars. Sequencing a human-sized genome, which has about 3 billion base pairs of DNA, now takes maybe 10 minutes and costs about a thousand dollars. While a decade ago we had one “Human Genome”, we now have lots. Mega-sequencing projects already planned or in progress include projects to sequence about 8,000 Finns, and the entire 50,000 population of the Faeroe Islands; one to sequence paired tumour and normal genomes from 20,000 cancer patients; and the UK10K project, which is investigating the genetic causes of rare diseases.

It is now almost extraordinarily simple and cheap to obtain genomic data, but real challenges remain in interpreting and understanding it so that it can be used in medicine. This is the province of bioinformatics, and Thornton devoted much of her presentation to explaining five ways in which gene (and protein) sequence information is being applied to both basic and clinical medical research:

Understanding the molecular basis of disease
Investigating differences in disease risk caused by human genetic variation
Understanding the genomics of cancer
Developing drugs for infectious diseases, including neglected diseases
Investigating susceptibility to infectious disease

There are rather more than 20,000 genes in the human genome, far fewer than were originally predicted. Tiny differences between individuals in many of these either directly cause a genetic disorder or confer an increased – or in some cases decreased – risk of developing a disease. The genetic causes of some diseases, such as the bleeding disorder haemophilia, were known many years before the “genome era”: others have been discovered more recently. Mapping known mutations onto the structure of the enzyme copper, zinc superoxide dismutase has revealed the cause of the inherited disorder amyotrophic lateral sclerosis, a form of motor neurone disease. And knowing the genome sequence has already made an enormous contribution to our understanding of the mechanisms of disease development, contributing to improvements in diagnosis and the design of novel drugs.

We now understand that cancer is a genetic disease: it arises when mutations in a group of cells cause them to grow and divide excessively. A cancer is no longer classified just by its location (for example, a breast or lung cancer) but by the particular spectrum of genetic variations in its cells. About 500 different genes are known to be mutated in cancer, some much more often than others. For example, about 60% of cases of melanoma, a type of skin cancer, contain one specific mutation in the gene BRAF. This codes for a protein that can direct cells to grow and divide, and the cancer-causing mutation sticks this protein into the ON position, so this signal is always sent. Scientists in a company called Plexxicon used their knowledge of this mutation and the structure of the protein to design a drug, vemurafenib, which prevents the BRAF protein from signalling. This can cause a dramatic, if short-term improvement in melanoma patients, but, crucially, it only works in patients whose cancers carry this mutation. It is one of the first developed examples of a “personalised medicine” that is only used alongside a diagnostic test for a genetic variation. There will soon be many more.

Genomics is also proving very useful in the fight against infectious disease. Antibiotic resistance is one of the greatest emerging threats to human health, and scientists have to use all the tools at their disposal, including genomics and bioinformatics, as they try to stay one step ahead of rapidly mutating pathogens. Sequencing is widely used to track the sources of outbreaks of infection and of resistant bacteria such as methicillin-resistant Staphylococcus aureus (MRSA) in hospitals, and it is the only way of determining the exact nature of an infection. One of the most dramatic examples of the use of genomics in infectious disease control occurred in 2011, when a novel strain of E. coli O104 caused about 4,000 cases of serious food-borne illness and 50 deaths in Germany. This was originally linked to cucumbers imported from Spain but a global effort to trace its specific sequence variants proved that the source of the infection was beansprouts grown on a farm near Hamburg.

There was much more to Thornton’s wide-ranging lecture than simply bioinformatics and medicine: more, indeed, than it is possible to do justice to in a single blog post. She went on to describe some of the benefits of genomics for agriculture and food security. These included designing new strategies for controlling pests and diseases, maximising the efficiency of biomass processing, and even managing biodiversity. It is necessary to measure biodiversity in order to manage it properly; it is now possible to define a short stretch of DNA sequence that fully identifies a species or sub-species (a so-called “DNA barcode”) and these are beginning to be used to track some very diverse organisms, including the 400,000 known species of beetle.

The lecture ended with a short discussion of some of the challenges facing bioinformatics and genomics in the second decade of this century, largely relating to difficulties with storing, manipulating and understanding the enormous quantity of data that is being generated. Mining this data mountain for the benefit of mankind is a task that is beyond either the academic community or the biotech industry alone. It will require novel ways of doing science that involve governments and charities as well as academia and industry. The new Centre for Therapeutic Target Validation, launched at Hinxton on the same day as Thornton’s Bernal Lecture, is a pioneering example of such a partnership. It has been set up by the EBI, the Sanger Institute where a third of the original human genome sequence was obtained, and pharmaceutical giant GSK, and its scientists aim to use the whole range of available genomic data to select and evaluate new targets for novel drugs.

Bioinformatics is covered in section 6 of the PPS course. Students who take the second-year option Techniques in Structural Molecular Biology will return to it then, where the material focuses on selecting protein targets for structural genomics initiatives: a task that is linked to that of selecting drug discovery targets.

This post will be cross-posted on the Birkbeck Events blog.

Principles of Protein Structure