RNA 3D Structure Course
by Craig L. Zirbel and Neocles Leontis at Bowling Green State University
Link to this document: https://tinyurl.com/RNA3DStructureCourse
This material was first developed by Craig L. Zirbel while visiting the University of Vienna and teaching a course in the Institute for Theoretical Biology in 2013-2014. Additional material was added by Professor Leontis in many places, especially when this material was used for workshops at the Rustbelt RNA Meeting. There are many exercises which we suggest that you take the time to do.
Table of contents (should also be available in the left side bar)
Online Recorded Lectures and Slides by Professor Leontis on RNA from 2013
Software for visualizing 3D Structures in PDB format
The RNA Bases: Purines and Pyrimidines
Exercise 1: Get to know the RNA bases
RNA base hydrogen bonding potential
Exercise 2: Canonical Watson-Crick AU and GC base pairs
Leontis-Westhof system for annotating RNA basepairs
Basepair orientations cis and trans
Three-character annotation of RNA basepairs
Exercise 3: Canonical Watson-Crick basepairs
Exercise 4: Triangle representation of RNA basepairs
Exercise 5: Annotating RNA basepairs
Exercise 6: Annotate successive RNA basepairs
Exercise 7: Form potential RNA basepairs from hydrogen bonding potential
Exercise 8: Annotate RNA basepairs
Symbolic representation of base pairing
Leontis-Westhof basepair symbols
Exercise 9: Using Leontis-Westhof symbols
Exercise 9: Identify 3’ and 5’ faces of bases
Exercise 10: Annotate base stacking
Exercise 11: Annotate base stacking
Symbols for base stacking from François Major’s group
Suggestions for making 2-dimensional RNA interaction diagrams
Exercise 13: Annotating an RNA motif
Base-phosphate interaction locations
Base-phosphate interaction and the assignment of atoms to nucleotides
Base-phosphate interaction character annotation
Base-phosphate interaction symbolic annotation
Exercise 14: Annotate base-phosphate interactions
Exercise 15: Annotate base-phosphate interactions in a helix
Exercise 16: Annotate base-phosphate interactions in a motif
Exercise 17: Extended investigation of basepairs
Anti versus syn conformation of the glycosidic bond
Exercise 18: Anti and syn in RNA helices
Exercise 19: Annotate anti and syn conformation
Exercise 20: Annotate anti and syn in an RNA motif
Exercise 21: Annotate anti and syn in an RNA motif
Experimentally determined RNA 3D structures
Where can one get RNA 3D structures?
Tour of RNA 3D structures using Representative Sets
Sequence variability in structured RNA molecules
Nucleotide to nucleotide alignments of RNA 3D structures
Exercise 22 - needs to be rewritten
Choosing basepair exemplars from each basepairing family
Displaying basepair exemplars from each geometric basepairing family
Position of glycosidic bonds in RNA basepairs in the same geometric family
IsoDiscrepancy Index to measure the degree of isostericity between two basepairs
IsoDiscrepancy values within each geometric basepairing family
Interactive RNA secondary structure
Exercise 25: Analysis of a tRNA structure
Sequence variability in the Sarcin-Ricin motif
RNA 3D Motif Atlas entry for instance IL_1S72_103 of the sarcin-ricin loop
RNA 3D Motif Atlas motif group IL_49493
Multiple sequence alignments of RNA sequences
Sequence variants of instance IL_3V2F_007 of the sarcin-ricin internal loop
Sequence variants of the sarcin-ricin motif across many instances
Inferring the geometry of RNA motifs with JAR3D
Searching RNA 3D structures with FR3D
Other interactions on circular diagrams
Symbolic search for two basepaired nucleotides
Exercise: Search for all GA basepairs in which G uses its Sugar edge
Exercise: Motif diagram for the core of the Sarcin-Ricin loop
FR3D symbolic constraint summary
Nucleotides which border or delimit a single-stranded region
Exercise: Symbolic search for hairpin loops and a little bit of context
Exercise: Symbolic search for internal loops
FR3D geometric and mixed searches
GNRA hairpin loop geometric search
FR3D guaranteed results in geometric searches
What determines the 3D structure of a 3-way junction?
Internal loop sequences with different geometry
Getting and using the Swiss PDB Viewer Program: Short Presentation
Goals of the Rustbelt Workshop
Like DNA, RNA is a linear polymer made from repeating units called Nucleotides. Like DNA, RNA has four nucleotides (A, G, C, and U instead of T in DNA). RNA is usually single-stranded while DNA is usually double stranded, with two complementary strands forming long Watson-Crick base-paired double helices.
What is a nucleotide? Each RNA or DNA nucleotide is made of three parts: A base (A, C, G, and T or U), a sugar (ribose or deoxyribose), and a phosphate group, connected covalently.
RNA nucleotides vs. DNA nucleotides: Both RNA and DNA nucleotides consists of three parts: A planar, Aromatic Base (consisting of C, N, O and H atoms), a sugar ring, and a negatively charged phosphate group (PO4). There are four types of bases in DNA (A, C, T, and G) and four in RNA (A, C, U, and G). There are two differences between DNA and RNA nucleotides:
Why does DNA have T? One of the most common forms of DNA damage is hydrolytic deamination of C to form U, which can pair with G to form a “wobble” pair. To repair such damage before it can lead to mutations, cells have an enzyme called uracil-DNA glycosylases that detect U in DNA and cut it out. Other enzymes replace the damage and restore C opposite G. Because in DNA T is found opposite A in place of U, the cell can detect where C is supposed to go during the repair process, because uracyl-DNA glycosylase only removes U and T.
How are nucleotides linked? DNA and RNA nucleotides are linked to each other by covalent phospho-ester bonds. Each phosphate links the two sugars of neighboring nucleotides together as shown →. The phosphates form phospho-ester bonds with the hydroxyl groups at the 3’- and 5’-positions of neighboring sugar units. The Base is attached to the C1’ position. The ring Oxygen is the 4’-O-.
DNA is double stranded and RNA is single stranded:
DNA is usually double-helical, due to the way it is replicated, comprising two complementary molecules running anti-parallel to each other. In DNA, each base is opposite its complementary base on the other strand (A opposite T and G opposite C), forming canonical Watson-Crick AT and GC base-pairs, that stack on each other like plates to make B-form, anti-parallel double helices.
RNA is usually single-stranded. But, by folding back on themselves, RNA molecules generally form short complementary double helices composed of AU and GC Watson-Crick basepairs, with occasional GU base-pairs (“wobble” pairs).
DNA helices are usually B-form while RNA helices are always A-form. Both are Right-handed.
What is secondary structure? The Watson-Crick-paired helices of RNA molecules correspond to the “secondary structure,” which is usually used to depict the structure of RNA molecules in a planar, easy-to-read format. As an example, this is the secondary structure of a tRNA, which consists of four helical regions (the “acceptor stem” to which the amino acid is attached to the 3’-OH), the D-stem, the anti-codon stem and the T-stem. tRNAs have 3 hairpin loops and one four-way junction (4WJ). The anti-codon loop has three nucleotides (in red) complementary to the corresponding mRNA codon for the amino acid. tRNA generally have several modified bases, shown in blue.
There are two types of bases in RNA and DNA, purines (A and G), consisting of fused 5 and 6 member rings, and Pyrimidines (A and U or T), consisting of 6 member rings. The purines, A and G, have the same ring atoms and only differ as to the attached (“exo-cyclic”) groups, which can be -NH2 or =O. The same applies to the pyrimidines, U and C. See Figure 1 below.
In the first several visualizations below, hydrogen atoms are shown, but in later visualizations, hydrogen atoms are not always shown.
Open the 3D views of RNA bases A, C, G, and U at this link.
Hydrogen Bonds vs. Covalent Bonds
Covalent bonds are much stronger (~400 kJ/mole) than the Hydrogen-bonds (H-bonds) that hold RNA base-pairs together (15-25 kJ/mole), so covalent bonds are more permanent, while H-bonds and other non-covalent interactions form and break readily during biological transformations.
Around the outside of an RNA base one finds chemical groups that have polar bonds, allowing them to form H-bonds. The atoms forming polar bonds, N, O, and H, have partial electrical charges. The partial charges are made apparent in the images below by coloring. Full-resolution versions of these images were published in this review chapter by Sweeney, Roy, and Leontis (2015). The following text was adapted from that article:
How to recognize H-Bond Acceptors:
For each base, the hydrogen bond acceptor groups are colored red, to reflect their partial negative charge. H-bond acceptor groups in RNA are “Localized Lone Pairs of Electrons on Oxygen or Nitrogen Atoms.”
Note: Lone pairs that are Delocalized and involved in Pi-bonding of the aromatic ring systems are NOT Hydrogen-bond Acceptors. For example, the lone pairs of exocyclic -NH2 groups on A, C, and G bases are delocalized, so the -NH2 groups only act as H-bond Donors.
How to recognize H-Bond Donors:
H-bond donor groups are comprised of hydrogen atoms covalently bonded to electronegative oxygen or nitrogen atoms. They are colored blue, reflecting their overall positive charges. The 2’-OH (hydroxyl) groups are colored purple to indicate that they can serve either as Hydrogen-bond donors or acceptors (or usually BOTH at the same time) and can therefore interact with two groups simultaneously. Each hydroxyl Oxygen atom has two localized electron lone pairs that act as acceptors. The H-atoms bonded to Nitrogen are in dark blue indicating they have more positive charge than the H’s attached to carbon atoms (shown in light blue). The darger blue indicates H-bond donor groups that make stronger H-bonds.
Figure 1. RNA base hydrogen bonding groups. H-bond donors in blue and H-bond acceptors in red.
RNA bases are often observed to lie in the same plane and to make hydrogen bond interactions with one another. These are called RNA base pairs. They are found in a rather small number of different geometries, dictated by the locations of positively and negatively charged regions around the sides of the bases.
Roughly ⅔ of the base pairs in an RNA molecule are what are called “canonical” Watson-Crick AU and GC base pairs. These are the analogue of the Watson-Crick basepairs in DNA. Please familiarize yourself with these basepairs at this link. The first two instances are from a very high resolution structure (0.9 Angstrom resolution) which shows the hydrogens, the next two are from a high resolution structure (2.4 Angstroms) which does not show the hydrogens, and the fifth instance is a long RNA double helix. It’s good to be able to recognize the bases and their edges both with and without the hydrogens.
When RNA chains fold back on themselves, one often sees short stretches of complementary bases which make Watson-Crick GC and AU basepairs. If that is all that RNA did, it would be very similar to DNA. But the strands of RNA come together in many different ways, and the bases come together in planar, edge-to-edge hydrogen bonding interactions in many different conformations than Watson-Crick basepairs. These are key to understanding RNA 3D structures and their sequence variability between different organisms.
We now introduce a standard way to refer to the different types of basepairs that RNA bases can make. It is a simple system devised by Neocles Leontis and Eric Westhof and originally appeared in the 2001 paper “Geometric nomenclature and classification of RNA base pairs” which is available at this link. Soon after, a paper by Leontis, Stombaugh, and Westhof showed the best known examples of each RNA basepair, see “The non‐Watson–Crick base pairs and their associated isostericity matrices” from 2002 at this link.
Figure 2. Leontis-Westhof system for base edges and basepair orientation. From Comprehensive survey and geometric classification of base triples in RNA structures, by Amal S. Abu Almakarem, Anton I. Petrov, Jesse Stombaugh, Craig L. Zirbel, Neocles B. Leontis, 2011, available at this link.
In their 2001 paper, Leontis and Westhof divided the outside atoms of each RNA base into three edges, called Watson-Crick, Hoogsteen, and Sugar Edge. These are shown in the left panel of Figure 2. Note that the sugar edge includes the 2’-OH group attached to the ribose sugar ring. Note that in Figures 1 and 2, the Watson-Crick edge of each base is on the right side of the base. When two RNA bases meet in the same plane, one can usually describe the basepair that is being made by telling which edge of each base is making the contact. This system is descriptive enough to capture nearly all recurring RNA basepairs, without being overly detailed. As you look at examples, keep in mind that the Leontis-Westhof system chooses a certain balance between simplicity and detail. In some cases, more detail might be warranted.
In the top of the right panel of Figure 2, one sees a C (on the left) using its Watson-Crick edge in a basepair with a G (on the right). This is the most common RNA basepair, which occurs in RNA helices. In the bottom of the right panel, one sees a U (on the left) using its Watson-Crick edge in a basepair with the Watson-Crick edge of an A. But this is not the common UA basepair from RNA helices. The bases are not in the right orientation for that; in order to make that basepair, one would need to flip one of the bases 180 degrees out of the plane before bringing the Watson-Crick edges together again. The possible orientations of the bases can be distinguished by noting whether the glycosidic bond, which connects the base to the ribose sugar of the backbone (overlaid with dark arrows in the figure) both lie on the same side of a line through the two bases (called cis) or on opposite sides of the line (called trans). The top basepair is in cis, the bottom one in trans. The basepairs in RNA helices are cis Watson-Crick / Watson-Crick basepairs. The 2001 Leontis-Westhof paper simply says about cis and trans that they “follow the usual stereochemical meanings” which you can read about in a Wikipedia article at this link. In particular, it says there that “The terms cis and trans are from Latin, in which cis means "on the same side" and trans means "on the other side" or "across".”
The Leontis-Westhof system allows for a simple 3-character description of each RNA basepair, using c or t for cis or trans, and W, H, or S for each of the interacting edges. Thus, for example, the top basepair in Figure 2 is annotated AU cWW for a basepair made between A and U in which A uses its Watson-Crick edge, U uses its Watson-Crick edge, and the basepair orientation is cis. The bottom basepair is annotated AU tWW.
Revisit the Watson-Crick basepairs at this link.
The triangles below represent two RNA bases, with Watson-Crick (WC, circle), Hoogsteen (H, square) and Sugar (S, triangle) edges labeled. Nt1 is short for nucleotide 1. They are duplicated on purpose.
View the 12 RNA basepairs at this link.
RNA basepairs are often made between successive bases in an RNA chain, for example in an RNA double helix or an RNA internal loop (which we will learn about in much more detail in the future). A first way to understand them is to list the covalently connected nucleotides in two columns, then draw arcs between basepaired nucleotides and write in the 3-character Leontis-Westhof annotation for the basepair, taking care to put the letters for the edges in the right order.
View the collections of nucleotides at this link, one at a time.
AC AG AU CG CU GU AA CC GG UU
The 3-character annotation of RNA basepairs is nice for printed text, like in a paragraph in an article. Projecting RNA 3D structures onto a 2-dimensional diagram is also helpful, but it can be hard to orient text on a diagram to show clearly which base is using which edge in a basepair. In 2001, Leontis and Westhof introduced a simple set of geometric symbols to represent the base edges used and the glycosidic bond orientation:
The following figure showing the Leontis-Westhof symbols appeared in a 2009 paper by Stombaugh, Zirbel, Westhof, and Leontis entitled “Frequency and isostericity of RNA base pairs,” which is available at this link.
Figure 3. Annotations and symbols for non-Watson-Crick basepairs. From “Frequency and isostericity of RNA base pairs,” which is available at this link.
Add Leontis-Westhof symbols to the basepairs that you annotated in earlier exercises.
It is rewarding but difficult to annotate RNA basepairs by eye. One helpful resource is the RNA Basepair Catalog, which shows exemplar instances of each base combination in each of the 12 basepair families. Comparing to the Catalog is a good way to check annotations done by eye, and it’s a good way to know which base combinations have never been observed in RNA 3D structures.
The BGSU RNA Group’s website maintains annotations of basepairs and other interactions in all RNA-containing 3D structures. You can see the list of basepair annotations of a tRNA at this link. That link is for PDB file 1EHZ; by changing the URL, you can see annotations for any other RNA-containing file from PDB. Click on the three-letter code to see each basepair. Note that some three-letter codes are preceded by the letter “n”. This stands for “near” and indicates that it is plausible that the bases make the indicated pair, but the coordinates lie just outside the cutoffs for that pair. Perhaps a different structure determination experiment would show a true pair there.
Nearby bases that are not in the same plane are often in parallel planes, above and below each other, so that they overlap when viewed from above. This is called base stacking. As you look at the examples to follow, ask yourself to what extent the bases are truly in parallel planes.
As with the Leontis-Westhof system for annotating basepairs, it is useful to annotate base stackings. A simple system for this purpose was introduced in the 2008 paper “FR3D: finding local and composite recurrent structural motifs in RNA 3D structures” by Sarver, Zirbel, Stombaugh, Mokdad, and Leontis, which is available at this link. The faces are named according to which direction they face in an RNA helix. The 3’ face is the one that faces toward the 3’ end of the chain (larger nucleotide numbers) while the 5’ face is the one that faces toward the 5’ end of the chain (smaller nucleotide numbers). The 3’ face is shown for each base in Figure 1 and in the left panel of Figure 2.
When the 3’ face of one base stacks on the 5’ face of another base, we annotate this with the 3-character annotation s35, and similarly with other combinations of faces. The successive bases in an RNA helix make s35 stacking interactions. Some bases make cross-strand stacking interactions, which are not s35. In the exercise below, you will annotate base stackings as s35, s53, s33, or s55.
View three successive basepairs from several RNA helices at this link.
View several pairs of bases in different stacking orientations at this link.
Once again, view two collections of RNA nucleotides at this link.
François Major of the University of Montreal led a focus group in the RNA Ontology Consortium focused on developing annotations for base stacking. One result was a symbolic system for annotated base stacks. Here are the symbols in terms of the base faces that we introduced earlier:
These symbols are described in the 2007 paper by St-Onge, Thibault, Hamel, and Major entitled “Modeling RNA tertiary structure motifs by graph-grammars“ and available at this link. Here is the relevant text from the paper:
Two possible orientations of two stacked bases result in four base-stacking types: upward (>>), downward (<<), outward (<>) and inward (><). Two arrows pointing in the same direction (upward and downward) corresponds to the stacking type in the canonical A-RNA double-helix. Upward or downward is chosen depending on which base is referred first (i.e. A>>B means B is stacked upward of A, or A is stacked downward of B). The two other types are less frequent in RNAs, respectively inward (A><B; A or B is stacked inward of, respectively B or A) and outward (A<>B; A or B is stacked outward of, respectively B or A).
To further differentiate stacking from basepairing, the line connecting two stacked bases is often drawn as an elongated capital letter I, with the top and bottom of the I suggesting that the bases are in parallel planes.
Here is a useful way to remember: from the 3’ face, there is an arrow out from the base; from the 5’ face, the arrow points in, toward the base.
It is very helpful to make 2-dimensional representations of RNA 3D structure. One cannot represent all of the detail of a 3D structure, but one can come close enough to gain real understanding of the 3D structure and make it easier to compare different small bits of RNA structures. Here are a few suggestions for how to lay out an RNA interaction diagram.Label the 5’ and 3’ ends of each strand and keep nucleotides from each strand together. This is especially important when sketching out a hypothetical or consensus motif that does not have nucleotide numbers on it. Put the 5’ end of one strand in the lower left corner.
Figure 4 below is a reasonable layout for the Sarcin-Ricin motif that we have been working with.
Figure 4. Layout for an interaction diagram for the Sarcin-Ricin loop.
Once again, view the two collections of RNA nucleotides at this link. Make symbolic interaction diagrams following the suggestions above. Use the layout suggested for the Sarcin-Ricin loop.
In addition to the base-base hydrogen bonds that are present in RNA basepairs and that occur between the bases and the 2’-OH group on the sugar ring in some sugar edge basepairs, RNA bases often make hydrogen bonds with one or more of the oxygen atoms in the phosphate group of the same nucleotide or a different nucleotide. Many of these interactions are base specific, and so will play a part in establishing the relationship between RNA sequence and RNA 3D structure. These “base-phosphate interactions” were studied systematically in a 2009 paper by Zirbel, Sponer, Sponer, Stombaugh, and Leontis entitled “Classification and energetics of the base-phosphate interactions in RNA” and available at this link. The phosphate oxygens are highly electro-negative, so that any hydrogen atom on an RNA base could be the “donated” hydrogen in a hydrogen bond. The typical distance between a nitrogen atom in the base and the phosphate oxygen atom is around 3 Angstroms, while the typical distance when a carbon is the hydrogen donor is around 3.5 Angstroms. The phosphate oxygen is often out of the plane with the base, but the angle made by the nitrogen/carbon atom, hydrogen, and phosphate oxygen atom is usually greater than 130 degrees.
Rather than use the base edge to describe the location of a base-phosphate interaction, the 2009 paper proposed a more detailed annotation of the location on each base where a base-phosphate interaction is made. Figure 5, which appears in the article, shows the possible locations and how they are numbered counterclockwise around the base. The numbering is done to facilitate noticing similar interactions being made by different bases. Note that the 0BPh interaction is in the same location for all bases. It is usually a self interaction, between the base and the phosphate of the same nucleotide. These are rarely annotated, but play an important role in the stability of many motifs. The 4BPh interaction is made only by G and involves two simultaneous hydrogen bonds, with one or two oxygens. Similarly, only C can make the 8BPh interaction. Note that base-phosphate interactions made by the Watson-Crick edge of G are particularly strong, these are 3BPh, 4BPh, and 5BPh.
Figure 5. Possible locations of base-phosphate interactions for each of the RNA bases. From the paper “Classification and energetics of the base-phosphate interactions in RNA” which is available at this link.
The four oxygens attached to the phosphorus of a nucleotide are called O5’, OP1, OP2, and O3’. In RNA 3D structures, if the phosphorus is part of nucleotide N, then O5’, OP1, and OP2 are part of nucleotide N as well, but O3’ is part of nucleotide N-1. In visualizations, O3’ may or may not appear together with nucleotide N. In the 3D coordinate viewers available in this course, you can hover over each atom to see the atom name.
As with basepairs, one specifies a base-phosphate interaction by listing the two bases involved. Because of the asymmetry of the interaction, it is typical to always list the nucleotide (base) which acts as a hydrogen bond donor first, then the nucleotide whose phosphate oxygen is the hydrogen bond acceptor. Thus, for example, we may have G14 5BPh U27 when G14 uses its H1 atom as a hydrogen bond donor to the phosphate group of nucleotide U27.
Starting with the 2009 paper on base-phosphate interactions, the main way to annotate these interactions is with with a “barbell” consisting of an open circle with P inside, for the nucleotide whose phosphate is used, and a filled circle with the number of the base-phosphate interaction (0 to 9) in white, as below.
Figure 6. Symbol for annotating a 5BPh interaction.
View the 18 pairs of bases at this link.
Return to the 6-nucleotide RNA helices at this link.
Return to the two collections of RNA nucleotides at this link.
Read two articles by Leontis and Westhof about RNA basepairs. They are:
Keep notes as you read. Here are some things to focus on, in addition to whatever else you want to write:
The glycosidic bond, which connects the RNA base to the ribose sugar ring, allows for 360 degrees of rotation around the bond, so that the geometry of the base and the sugar could be quite variable. However, it turns out that most RNA bases have roughly the same orientation between the base and sugar, as characterized by the conformation of the ribose sugar in the standard RNA helix. This is called the anti conformation. Around 4% of all nucleotides have a completely different orientation, called syn, in which the phosphate group of the nucleotide is closer to the sugar edge. The purpose of the following exercises is simply to help you see the difference between clear examples of the two glycosidic bond conformations.
View some standard RNA helices at this link.
View 11 nucleotides with different glycosidic bond conformations at this link.
Return to the two collections of RNA nucleotides at this link.
View 7 instances of a new internal loop motif at this link.
This section gives an overview of the RNA-containing 3D structures available for download.
As a condition of publication, atomic-resolution 3D structures of biological molecules are deposited with the worldwide Protein Data Bank (wwPDB), which has four members: PDB in the US, PDBe in Europe, PDBj in Japan, and BMRB for NMR structures. One can visit any of these sites to explore what 3D structures are available. The Nucleic Acid Knowledgebase (NAKB) is a close partner of the BGSU RNA group, with additional search and viewing features. PDB101 is an educational site about 3D structure data, and the section Introduction to PDB Data is particularly useful for the details of 3D structure.
In order to give an overview of what structures are available, please follow along with this tour of the BGSU RNA Group’s page for Representative Sets of RNA 3D Structures.
Of particular interest in RNA are the portions between the helices and at the ends of the helices. The tutorial at this link will orient you to RNA hairpin, internal, and multi-helix junction loops. The RNA 3D Motif Atlas collects together all internal loop and hairpin loop instances from across a 4.0A representative set of RNA 3D structures and collects together instances into motif groups, with the goal that within each motif group, all instances are instances of the same motif. Due to the natural variability of RNA molecules, there is some variability in the geometry within each group, but overall, the clustering is successful. Ideally, the Motif Atlas is updated every four weeks. It is helpful to be aware of at least some of these groups, so please follow the links below.
DNA is not always copied perfectly when organisms reproduce. The two main types of errors that we consider in this course are mutations and insertion/deletions (indels). At the moment, the focus is on mutations, a change from one nucleotide in a DNA position to a different one. Among other things, this can result in a base change in a structured RNA molecule. Roughly speaking, there are three possibilities: the mutation can disrupt the structure of the RNA molecule to such a point that the offspring cannot survive or cannot reproduce, and so we do not see these mutations in living organisms. Occasionally, the mutation may confer a survival advantage to the organism, the offspring of the organism may outcompete others in its species, and so this mutation may become fixed in a population. The third possibility is that the mutation has no significant effect on the performance of the structured RNA, in which case the mutation may come to exist in the population at some level between 0% and 100%. These are called “neutral” mutations, although some might be slightly advantageous and others slightly disadvantageous. These are the most commonly-observed types of mutations.
The first step in understanding typical sequence variability in structured RNA molecules is to compare 3D structures of homologous molecules, ones that share a common ancestry. See our new tools in 2024 which are described at https://www.bgsu.edu/research/rna/web-applications/r3d-align.html The basic idea of the alignments is to identify nucleotides in each structure which correspond geometrically in the sense that their local 3D neighborhoods have the same local 3D neighborhood in the other structure.
Start at the R3D Align Gallery of Featured Alignments at this link.
We will look more at alignments of 3D structures later. The point now is that we can directly examine the structures and see (with more work) that in the vast majority of cases, basepairing family is conserved between 3D structures of homologous organisms.
A well-known example of neutral mutations in RNA basepairs is the covariation of bases in cis Watson-Crick / Watson-Crick basepairs in RNA helices. Looking at a multiple sequence alignment of RNA homologous sequences (same molecule, different organisms, but where the molecule first arose from a common ancestor), one will often observe changes between the “canonical” base combinations AU, UA, CG, and GC in two columns whose nucleotides make a cWW basepair. It is important to note that, because the two nucleotides making a cWW basepair will tend to be at least 6 positions away from one another in the RNA sequence (enough sequence length to form a hairpin loop), and are often much further away in sequence, it cannot be assumed that both mutations happen at the same time, but rather that one happens and is tolerated, and that in a later generation, the compensatory mutation occurs. Thus, cWW covariation has implicit in it the possibility of non-matched basepairs occurring in RNA helices (these are often called “mismatches,” but that is probably a good word to avoid as it does not really explain what is happening; a “mismatch” could form an important non-canonical basepair), at least for a number of generations.
The purpose of this section is to explore the basis for Watson-Crick covariation and to extend the observations we make there to other RNA basepairs, in order to make predictions about likely and unlikely sequence variability in structured RNA molecules.
One good place to start examining the geometric similarities between RNA basepairs is to have “exemplar” instances of each base combination in each basepairing family. To this end, all instances of each base combination for each basepairing family were extracted from a non-redundant set of RNA-containing 3D structure files determined by x-ray crystallography with resolution 4.0 Angstroms or better. (As of 2013-11-25, structures with extension .pdb1 were excluded from this analysis.) Geometric discrepancies were calculated between each pair of basepairs, as described in the 2008 article “FR3D: finding local and composite recurrent structural motifs in RNA 3D structures” by Michael Sarver, Craig L. Zirbel, Jesse Stombaugh, Ali Mokdad, Neocles B. Leontis, which is available at this link. Basepairs whose bases are “coplanar” as defined in the 2011 article “Comprehensive survey and geometric classification of base triples in RNA structures” by Amal S. Abu Almakarem, Anton I. Petrov, Jesse Stombaugh, Craig L. Zirbel and Neocles B. Leontis, which is available at this link were separated from those which are “non-coplanar”. If there is at least one coplanar basepair, the coplanar instance which minimizes the product of the reported resolution of the structure, the sum of discrepancies to all other instances, and the numerical rank of the candidate when listed by the sum of discrepancies to all other instances, is chosen as the exemplar. If there is no coplanar instance, the same ranking is applied to the coplanar instances, or else a curated instance is used.
The exemplar basepairs are displayed in the RNA Basepair Catalog hosted at the Nucleic Acid Knowledgebase; it is available at this link. For each basepairing family, instances of each known base combination making the basepair are shown. This is a very useful reference to be able to see the best known instances of each base combination in each of the 12 basepairing families.
The basepair exemplars from each geometric basepairing family are shown in a single PDF, available at this link. The families are listed in the same order as in the 2002 paper by Leontis, Stombaugh, and Westhof. Thus, cWW starts on page 1, tWW on page 4, cWH on page 7, tWH on page 10, cWS on page 13, tWS on page 16, cHH on page 19, tHH on page 22, cHS on page 25, tHS on page 28, cSS on page 31, and tSS on page 34. Above each basepair is a title which indicates the base combination, the PDB ID of the 3D structure it is taken from, the nucleotide numbers within that structure (but not the chain), the distance in Angstroms between the C1’ atoms of the two nucleotides, and the number of instances of this base combination in the non-redundant set from which the exemplars were drawn.
Some notes about the basepairs.
On page 1, take a moment to compare the glycosidic bonds on the AU, CG, GC, and UA cWW basepairs. These appear diagonally from upper right to lower left. The glycosidic bond of the first nucleotide is always vertical, so focus on the relative location and orientation of the second one. You can see that the orientation (angle) is roughly the same in each case. Also, note from the title above each instance that the C1’-C1’ distances are nearly the same, 10.6 Angstroms for GC and CG, and 10.4 Angstroms for UA and AU. To really be able to focus on the relative orientations of the glycosidic bonds, go to a later page which shows just the glycosidic bonds for the base combinations in the cWW family. The glycosidic bonds of the first base are all superimposed in the lower left, so that you can compare how different the relative positions of the glycosidic bonds of the second bases are. Note how close in the plane the second glycosidic bond in the AU, UA, GC, and CG base combinations are. Note that their orientations (angle with respect to horizontal) are identical. As far as we know, this is one of the reasons that one frequently sees substitutions between AU, UA, GC and CG in RNA helices: substituting one base combination for another need not disturb the RNA backbone at all.
Now find the glycosidic bonds of the GU and UG cWW basepairs. They are at a considerable distance from the AU, UA, CG, and GC bonds, although they have the same orientation. Note, in particular, that the glycosidic bonds from GU and UG are really far apart from each other. This explains how it is that occasionally one sees a GU or a UG substituting for a canonical cWW basepair in an RNA helix. But there are other locations where, for some reason, a UG or a GU cWW is preferred, and there one does not often see the base combination in the other order.
In the 2009 paper, “Frequency and isostericity of RNA base pairs” by Jesse Stombaugh, Craig L. Zirbel, Eric Westhof, and Neocles B. Leontis which is available at this link, a quantitative measure of basepair isostericity was defined and called “IsoDiscrepancy Index” or IDI. The idea is to make a score that increases the more dissimilar the glycosidic bonds are. The images below are reproduced from the 2009 paper and show the three contributions.
These three measurements are combined as explained in the 2009 paper to calculate the IDI between different instances of RNA basepairs. The exact details need not concern us here.
The same PDF that shows instances of each basepair also shows calculated values of the IsoDiscrepancy Index between exemplars, in a colorful matrix. The cells in the matrix are colored according to the IDI value, using the scale at the right of the matrix. The numerical value also appears in each cell. Base combinations are ordered the same in the rows as in the columns, and the ordering is chosen to put similar base combinations near each other in the list, so as to put more of the small IDI values near the diagonal.
Look at the IDI matrix for the cWW family and confirm that the mutually isosteric subgroups match your expectations from the superpositions of glycosidic bonds and the basepair exemplars themselves.
RNA molecules are made by transcription from DNA, which joins together the four RNA nucleotides A, C, G, and U in the order given by the DNA strand that is being transcribed. Transcription is said to go in 5’ to 3’ order; these numbers refer to the O5’ and O3’ oxygen atoms on the RNA backbone and this gives a standard sense of direction along the RNA chain. In RNA 3D structures, the nucleotides are always numbered in increasing order, following 5’ to 3’ order. The sequence of nucleotides is called the primary structure. For most RNA molecules, the primary structure is all that is known from direct experimental evidence, usually genome sequencing.
Even while the rest of the RNA molecule is being transcribed, the nucleotides that are already transcribed are moved about randomly by the molecular motion of water in the cell and occasionally two strands will come together and form a sequence of cWW basepairs, which will then form themselves into an RNA helix. We have seen portions of RNA helices in earlier exercises and you can see them again at this link. By far the most common base combination to form cWW basepairs is GC (7478 instances), followed by AU (2641 instances), with GU a distant third place (796 instances). GC and AU base combinations are often called canonical while GU is called wobble. In structured RNA molecules, RNA helices typically consist of between 2 to 10 stacked cWW basepairs. The list of GC, AU, and GU Watson-Crick basepairs in RNA helices is called the secondary structure of an RNA. (Pro tip: in some structures, some Watson-Crick basepairs “cross” others, forming pseudoknots. That is a subject for a later discussion.)
Re-visit the short RNA helices at this link.
Each RNA-containing 3D structure has a web page on the BGSU RNA Group website. Each such page includes a viewer for the secondary structure. A key feature of the viewer is that after selecting portions of the secondary structure, a window opens to display the corresponding nucleotides in 3D.
One of the most common and most often studied structured RNAs is tRNA, or transfer RNA. tRNAs help in the process of protein synthesis, called translation, by bringing the correct amino acid to the right place on the ribosome at the right time. Many tRNA 3D structures have been solved, but they differ from one another in small ways.
Open the 2D diagram for the tRNA molecule in PDB file 2CV1 at this link.
Transfer RNA has a hairpin loop known as the T-loop, probably because in tRNA it contains a modified nucleotide (thymidine), but it also fits that the T-loop occurs in tRNA. The same type of hairpin also occurs in ribosomal RNA, some riboswitches, and other structured RNAs.
We have examined one instance of the sarcin-ricin motif many times already in this course, because it is a common motif that has a variety of non-Watson-Crick basepairs, stacking arrangements, and base-phosphate interactions. It is the second motif at this link. That particular instance comes from PDB file 1S72, as you could find out by viewing the page source and looking toward the bottom for the Unit IDs of the nucleotides being shown. The first Unit ID is 1S72_AU_1_9_75_G_, which means that the nucleotide comes from PDB file 1S72, from the “asymmetric unit”, from model 1, from chain 9, and is residue 75, which is a G for guanine.
One can read about the 3D structure file 1S72 by simply doing a web search for 1S72, which should quickly lead to the PDB (Protein Data Bank) entry for 1S72 at this link. Take a few minutes to look through that page to get a sense for what is provided there.
The 1S72 structure was solved by a group at Yale University including Tom Steitz, who shared the 2009 Nobel prize in Chemistry for making high-resolution 3D structures of the Haloarcula marismortui (domain archaea) large ribosomal subunit. It really is a well-modeled 3D structure, with very nice basepairs.
The specific instance of the sarcin-ricin loop in chain 9 of 1S72 is an internal loop with nucleotide numbers 75 to 81 on one strand and 101 to 106 on the other strand. Other instances of internal loops with the same overall geometry and basepairing interactions also exist. There are three ways to find which “motif group” contains this instance from 1S72. Let’s try them all.
The goal of this section is to explore the structural and sequence variability of internal loops that are similar to the sarcin-ricin instance from chain 9 of 1S72, H.m. 5S rRNA, that we have been studying in earlier exercises. As of June 29, 2014, the current version of this motif group is available at this link. The instances collected here come from a variety of RNA 3D structures. They all share the same basic geometry and basepairing interactions (with a few possible exceptions that we’ll discuss below).
To this point, we have been looking at instances of RNA motifs from 3D structures solved at atomic resolution. This has been done for a very small number of organisms. In the case of ribosomal RNAs, we have 3D structures from three bacteria (E. coli, Thermus thermophilus, and Deinococcus radiodurans), one archaeon (Haloarcula marismortui), and two eukaryotes (yeast and Tetrahymena thermophila). However, ribosomal RNA sequences alone have been determined for hundreds of thousands of organisms. They vary in the ways that we noted above, with different bases in certain locations and with insertions and deletions of bases relative to one another. Even so, research groups have determined good alignments between these sequences, which means that they have determined which nucleotide positions in one organism correspond to positions in other organisms.
We can learn a lot about RNA 3D structure by studying the sequences from other organisms that are aligned to the nucleotides corresponding to an RNA motif whose 3D structure is known. Below is a first example of this.
Throughout the course we have been looking at instance IL_1S72_103 of the sarcin-ricin internal loop from the archaeal 5S rRNA. Unfortunately, I don’t have a nice alignment of archaeal 5S rRNA sequences. Instead, we’ll look at sequence variants of another instance that has the same sequence except for the flanking cWW basepairs, namely IL_3V2F_007 from Thermus thermophilus, a bacterium that likes to live at 65 Celsius and was first identified in a hot spring in Japan. The following text will orient you to the alignment.
Here are some tasks to work on with this small alignment.
The question being addressed here is: if we look at sequence variants from sequence alignments corresponding to motif instances at non-homologous positions, how much sequence variability do we see? This will give us some idea how much sequence variability there is in RNA 3D motifs.
JAR3D stands for Java-based Alignment of RNA using 3D structure. JAR3D is usually pronounced Jared. The goal of JAR3D is to take as input one or more sequences of an RNA internal loop or hairpin loop and match it to all known internal and hairpin loop motif groups to find the 3D motif that is the best match for the given sequence(s). In this way, it is possible to infer the 3D geometry of an internal or hairpin loop from its sequence. The ideal case is when an exact sequence match can be found, but that is rare, so it is necessary to have a system for inexact sequence matches, and that is what JAR3D provides, using basepair isostericity to provide the matches. The JAR3D server is available at this link.
See the tutorial on JAR3D at this link. See additional tutorials at this link.
FR3D stands for “Find RNA 3D” and is usually pronounced Fred. It is a collection of programs written in Matlab which read RNA 3D structure files in PDB format, annotate basepairs and other interactions, and allow for a variety of searches to be conducted. The purpose of this part of the course is to become adept at constructing searches and understanding the results, with the further purpose of understanding RNA tertiary interactions.
Although FR3D was written in Matlab, some people don’t have access to Matlab because it costs money. One can use GNU Octave to run the programs instead. GNU Octave is free, which is good, and it runs most Matlab commands, but it does not have the same graphical user interfaces that make FR3D work so smoothly on Matlab. Nevertheless, it is possible to get FR3D to work well enough on Octave. There is an installer for Octave on Windows, at this link. Also, note that different Octave installations seem to have different graphics interfaces. From what I have seen, Octave FR3D works well on Linux platforms.
I have prepared a zip file that contains the FR3D programs, some binary data files, some canned searches for us to look at, and some circular basepair diagrams, which will be explained in due course. To download FR3D, begin by downloading the zip file, which is available at this link. Save it in a place that you have access to and unzip it. This will create a folder called FR3D, which has a number of sub-folders.
Start Octave and set the working directory to the FR3D folder using the cd command to change directories. You will probably find it helpful to use the pwd command to print the current working directory and the ls command to see the contents of the current working directory, until you find your way to the FR3D directory. If you need to go up one directory level, type cd .. (cd space two periods).
Tell Octave where the FR3D files are by typing oSetPath at the Octave prompt. This will run the program oSetPath, which will add the folders FR3DSource, PrecomputedData, SearchSaveFiles, and PDBFiles to Octave’s search path. The letter o at the beginning of program names means that it was written specifically for use with Octave.
Load a saved search by running the program oLoad at the command prompt. oLoad reads the filenames of saved searches from the folder SearchSaveFiles, numbers them, and prints them to the screen so that you need only type the number of the search you want to see. Let’s load the search called “All AG basepairs” which consists of all AG basepairs in 3D structure file 1S72. oLoad loads the save file and runs oDisplay to display them on the screen. If oDisplay ever crashes, you can restart it by typing oDisplay at the Octave prompt.
oDisplay is a menu-based program that shows one candidate at a time from the search that was loaded. “Candidate” is the generic word for the RNA fragments found by a FR3D search. This word is used because not every fragment that is found will really be of interest, they are merely candidates for being interesting. oDisplay continually shows the menu in the main display window so the user can enter a numerical choice. Entering 1 or just hitting Enter will advance to the next candidate. Here are all of the menu choices with brief explanations:
1 Next candidate (or just press Enter) | Advance to the next candidate in the current ordering |
2 Previous Candidate | Return to the previous candidate |
3 Add plot | Additional window to show candidates in |
4 Append output to ... | Send screen output to the named file, which will be saved in the folder SearchSaveFiles. This makes it possible to view very wide output using a text editor which has a horizontal scroll bar. |
5 Visualization options | Change visualization options such as nucleotide coloring and how much of the neighborhood of the candidate to display |
6 Jump to candidate | Specify a candidate to display by PDB ID or other text identifier |
7 Mark/Unmark current | Individual candidates can be marked for further analysis |
8 Reverse all marks | So you can mark the ones you don’t want, instead |
9 Display marked only | Show only the marked candidates |
10 List to screen | List candidates and their pairwise interactions. If Octave does not have a horizontal scroll bar, it may be necessary to copy and paste the output into an editor that can view lines without wrapping them. In class, Applications, Programming, Nedit. |
11 Write to PDB | Write out the current candidates in PDB files |
12 Sort by centrality | Order candidates, starting with the centroid; displays a heat map of discrepancies between candidates |
13 Order by Similarity | To the extent possible, but similar candidates near one another; displays a heat map of discrepancies between candidates |
14 Show Alignment | Show a sequence-oriented alignment of the candidates |
15 Show Scatterplot | Make scatterplots of pairwise base orientations |
16 Navigate with Fig 99 | Use a heat map to display the discrepancies between candidates, useful for navigation |
17 Rotate 20 degrees | The Matlab/Octave 3D rotation program can only rotate about 2 axes, making it hard to achieve certain views. This option will rotate 20 degrees about the third axis. |
18 Quit display | Quit the display program |
For each candidate, oDisplay gives a description of the pairwise interactions in the current candidate, then shows the menu again. The 3D coordinates of candidates are shown in Figure 1. Arrange the windows so that you can see the main window and Figure 1, and press Enter several times and look at the different candidates. Rotate the 3D coordinates to put the A in a standard position, with the Watson-Crick edge to the right and the glycosidic bond vertical. (To rotate on the computers at ITC in Vienna, click the R tab at the bottom of the Figure 1 window.
Earlier, we looked at the circular interaction diagram for the Haloarcula marismortui large ribosomal subunit from PDB file 1S72. The circular diagrams are prepared by FR3D and are available as PDF files in a folder in the large zip file, see above. The nucleotides in the structure are arranged around the outside of a circle, clockwise, with chains and round numbers labeled. Circular arcs are drawn through the center of the circle to indicate pairwise interactions. The colors of the arcs indicate the type of interaction. Dark blue arcs indicate nested cis Watson-Crick / Watson-Crick basepairs. These are the kinds that you normally see represented on a secondary structure diagram, but also include non-canonical base combinations such as AG or UU. The circular arcs in a given helix are nested inside one another, like Russian matryoshka dolls. In particular, the blue arcs do not cross each other. There are additional cWW basepairs which cross over the blue ones; these are called non-nested cWW pairs and are colored red. Such basepairs usually come in a set, which geometrically twist into a double helix like other RNA helices. In fact, one could just as well label these as being “nested” and the cWW arcs they cross as being “non-nested”, it’s a matter of convention. A good example of the arbitrary nature of the labeling can be found in 1S72, looking at the helices made by 1449-1450 and 1511-1512 as opposed to 1451-1453 and 1674-1676. In these diagrams, the cWW basepairs closest to hairpin loops are labeled first as being nested, and moving out from the hairpins, once a cWW crosses a previously-labeled cWW arc, it is called non-nested. Non-nested cWW pairs are usually said to form pseudoknots, and these are usually not shown on secondary structure diagrams. It is possible to infer the presence of Watson-Crick pseudoknots by sequence covariation, just like with nested cWW helices. Prediction of pseudoknots from RNA folding is, however, somewhat harder.
In many places in the circular diagrams, you will see two or more helices (blue arcs) next to each other but nested within a larger helix. These represent multi-helix junctions. The degree of the junction can be counted by counting the inner helices plus the outer helix. In particular, chain 9 of 1S72 is the 5S rRNA, which has a 3-way junction.
For each arc on a circular diagram, it is helpful to count the number of nested cWW arcs that it crosses, since this gives an indication of whether the interaction is local to the secondary structure (such as the basepairs in an internal loop) or whether it is long range relative to the secondary structure. Non-Watson-Crick basepairs which do not cross any nested cWW pairs are said to be nested and are colored cyan (light blue). Non-Watson-Crick basepairs that cross one or more nested cWW pairs are colored green. On most circular diagrams, there are far more green arcs than red, suggesting that non-nested non-Watson-Crick basepairs play a bigger role in determining the 3D structure of a structured RNA molecule than Watson-Crick pseudoknots do.
Base stacking interactions are colored yellow, base-phosphate interactions are colored purple, and base-ribose interactions are colored orange. These are drawn before the basepairs, so they may be covered over by basepairs, although they start from slightly different places, so they should not be covered entirely. Note that there are many long-range interactions of these three types, and that they also occur in multi-helix junctions. In FR3D searches for two nucleotide motifs, you will see a crossing number associated with these interactions as well.
Use Octave to load the search for AG basepairs. Order by similarity and then list the basepairs to the screen. Note the crossing number at the far right of each line, and make particular note for each basepair type whether it is typically nested or non-nested.
When using Octave to do FR3D searches, one needs to use Matlab-style commands to set various variables that define the parameters of a search. So in addition to explaining what types of searches can be done, we will need to learn how to set the search parameters with the right syntax. We will alternate between explaining a search and having you construct your own search. I recommend that you open a text file to save the text of your searches, and that you save them as separate blocks of code, clearly labeled. If you name your file mysearches.m and save it in the FR3D folder, then you can define the first search in the file by running mysearches.m from the Octave command line, which you do by typing mysearches and hitting Enter. Or you can copy and paste text into the Octave window. In the computer lab near the TBI in Vienna, I recommend using Nedit, which you can get to with Applications, Programming, Nedit.
The following text sets up the search for AG tHS basepairs.
% AG tHS basepairs
clear Query % remove any previous Query parameters
Query.Name = 'AG tHS basepairs';
Query.Edges{1,2} = 'AG tHS';
Query.SearchFiles = {'Ribosome_list'};
oFR3DSearch
return % stop execution of the current program
Note that the % character begins a comment in Matlab; the % and text after it on the same line will be ignored. The second line clears the variable Query, so that parameters from previous searches don’t accidentally become a part of this search. Query.Name is text that will appear in the binary file where the results of the search will be saved, and which you will be able to load later, so take some care to choose it appropriately. Query.Edges is a cell array (different than a matrix of numbers) used to specify basepairs (in terms of Watson-Crick, Hoogsteen, and Sugar edges) and also base stacking. It can also be used to specify base combinations of interest. Order matters very much; Query.Edges{1,2} = ‘AG tHS’ means that nucleotide 1 must be A, nucleotide 2 must be G, and nucleotide 1 must use its Hoogsteen edge in a trans Hoogsteen-Edge basepair with the Sugar edge of nucleotide 2. Note the curly braces { and }. Query.SearchFiles is also a cell array telling which 3D structure files should be searched. ‘Ribosome_list’ refers to a text file named ribosome_list.pdb which appears in the PDBFiles folder in the FR3D folder, and which lists the PDB IDs of 9 ribosomal 3D structures. One could name them individually, this way:
Query.SearchFiles = {‘3U5H’,’4A1B’,’2QBG’,’3V2F’,’1S72’,’4IOA’,’3U5F’,’4BPP’,’2AW7’,’1FJG’}
But that is not as clear to some people!
Copy and modify the code above to find all GA basepairs in which G uses its Sugar edge. Search just the 3D structure E. coli ribosome files 2QBG and 2AW7. Note that in order to get additional basepairs beyond tSH, add them to Query.Edges{1,2} separated by spaces, which is interpreted as logical or:
Query.Edges{1,2} = ‘GA tSH cSH …….’
(you fill in the … part). How many basepairs do you find? Order them similarity and look at Figure 99. How many distinct clusters do you see? List the candidates. Do you see the same clusters in that list as in Figure 99?
When designing a FR3D search, it is very helpful to first draw a diagram of the nucleotides you are looking for, so that you can number them and write out the basepairs consistently. Here is a diagram for the AG tHS search:
The following text sets up a symbolic search for the core of the Sarcin-Ricin loop. It is a 5-nucleotide motif.
clear Query % remove any previous Query parameters
Query.Name = 'Sarcin five nucleotide symbolic';
Query.Edges{1,2} = 'cSH';
Query.Edges{3,4} = 'tHS';
Query.Edges{2,5} = 'tWH';
Query.Diff{2,1} = '> =1';
Query.Diff{3,2} = '> =1';
Query.Diff{5,4} = '> <=5';
Query.SearchFiles = {'Ribosome_list'};
oFR3DSearch
return
Copy these commands into Octave. While it is running, reproduce the diagram below and annotate it to show the basepairs in the motif definition.
The Query.Diff entries tell about the backbone chain connectivity between different nucleotides. They can be read as follows:
Nucleotide 2 has a higher number than nucleotide 1, and the difference is exactly 1
Nucleotide 3 has a higher number than nucleotide 2, and the difference is exactly 1
Nucleotide 5 has a higher number than nucleotide 4, and the difference is no more than 5
Note that it is a good idea to number the nucleotides in a motif diagram so that they reflect the 5’ to 3’ ordering that you want to get.
When the search is done, order by similarity, view Figure 99. If you look hard, you can see small differences between the candidates, in both sequence and geometry. How many sequences do not share the most common sequence GUA*GA?
List the candidates to the screen with menu option 10. Unfortunately, the terminal window that Octave runs in wraps the lines, which are much too wide. In Vienna in the computer lab near the TBI, you can start a text editor by doing Applications, Programming, Nedit. Copy and paste the candidate list into this editor, and turn off line wrapping to view it.
Compare the nucleotide numbers of nucleotides 3 and 4. Does nucleotide 3 always have a lower (or higher) nucleotide number than nucleotide 4? What does this mean?
After the columns listing basepairs and after the column listing glycosidic bond orientation, the columns list the distances between nucleotides along the nucleotide chain. Are there any instances where nucleotides 4 and 5 are not adjacent in the chain?
There appear to be conserved base-phosphate and base-ribose interactions between some nucleotides. Find these interactions in one instance and then check to see whether they are conserved in other instances.
Filename Number Nucl 1 2 3 4 5 Chain
1S72 1 G 175 U 176 A 177 G 159 A 160 00000
1S72 2 G 213 U 214 A 215 G 225 A 226 00000
1S72 3 G 358 U 359 A 360 G 292 A 293 00000
1S72 4 G 381 U 382 A 383 G 406 A 407 00000
1S72 5 G 464 U 465 A 466 G 475 A 476 00000
1S72 6 G 588 U 589 A 590 G 568 A 569 00000
1S72 7 G 953 U 954 A 955 A 1012 A 1013 00000
1S72 8 G 1292 U 1293 A 1294 G 911 A 912 00000
1S72 9 G 1370 U 1371 A 1372 G 2053 A 2054 00000
1S72 10 G 1971 U 1972 A 1973 G 2009 A 2010 00000
1S72 11 G 2692 U 2693 A 2694 G 2701 A 2702 00000
1S72 12 G 78 U 79 A 80 G 102 A 103 99999
Use this motif diagram to define a FR3D search and run the search on 1S72. Order the candidates by similarity and comment on what you see there.
The following figures were prepared for the 2011 article WebFR3D—a server for finding, aligning and analyzing recurrent RNA 3D motifs by Anton I. Petrov, Craig L. Zirbel, and Neocles B. Leontis, which is available at this link. They are keyed to the matrix-oriented input screen (panel a) that one finds with FR3D on Matlab and on the WebFR3D server. For our purposes, the interaction constraints in panel b are set in the Query.Edges variable, the sequential distance constraints in panel c are set in the Query.Diff variable, and the nucleotide identity constraints in panel d are set in the Query.Mask variable.
RNA hairpin loops occur at the end of a helix. In order to find them, it is useful to be able to search for the last canonical cWW basepair (AU, GC, or GU) before the hairpin. The two nucleotides in this basepair have two things going for them: they make a canonical cWW basepair and between them is a non-empty “single-stranded” part of the chain whose nucleotides make no additional nested canonical cWW basepairs. We say that these nucleotides border or delimit a single-stranded region. The following search finds such pairs:
clear Query % remove any previous Query parameters
Query.Name = 'cWW Border SS';
Query.Edges{1,2} = 'cWW borderSS';
Query.Diff{2,1} = '>';
Query.SearchFiles = {'1S72'};
oFR3DSearch
return
This search returns 72 instances from 1S72. What does that tell us about the structure? Why is it useful to put the > constraint between nucleotides 2 and 1?
The previous search only finds the flanking cWW basepair of hairpin loops, which does not tell us much about the structure of the hairpin. Let’s take a step in this direction by writing a search for the two nucleotides making the flanking cWW pair and the two nucleotides that are adjacent to those, inside the hairpin. Use the following diagram to design the search, making 1 and 4 be the cWW pair and satisfy the borderSS condition, and making nucleotide 2 have a higher nucleotide number than 1 but be adjacent in the chain, and making 3 have a lower nucleotide number than 4 but be adjacent in the chain. Don’t put any restriction on the number of nucleotides between 2 and 3.
Once you have the search designed, code it and run it with FR3D. Order the candidates by similarity. What lessons can you learn from these hairpins?
Internal loops are more complicated than hairpin loops because they are made by two single-stranded regions and they have two closing canonical Watson-Crick basepairs. Draw a diagram for a 4-nucleotide query which will find all internal loops in 1S72 which have at least one nucleotide in each single-stranded region. Be sure to think about the cWW pairs, the borderSS relation, chain directionality, and anything else.
Type up this query and run it with FR3D. How many such internal loops are present in 1S72?
To this point, all of the searches we have done were made up of symbolic constraints, restrictions on basepairs and chain connectivity, to which we could add base stacking, base-backbone interactions using the different codes in the charts above. Now we turn our attention to geometric searches, where we start with a known instance and would like to find other instances of a similar motif. As we will see, one can add symbolic constraints to a geometric search in order to focus it and speed it up.
There are a few common classes of hairpin loops in structured RNA molecules, and the GNRA loop is the most common. The string GNRA tells the “consensus” sequence of the hairpin, which is G followed by N (anything) followed by R (A or G) followed by A. Most instances of the GNRA fit this pattern, but certainly not all. Here are the parameters for a geometric search for GNRA hairpins in 1S72:
clear Query
Query.Name = 'GRNA hairpin geometric search';
Query.Filename = '1S72'; % file containing query instance
Query.NTList = {'804' '805' '806' '807' '808' '809'};
% nucleotide numbers of query
Query.ChainList = {'0' '0' '0' '0' '0' '0'};
% chains of query (optional)
Query.DiscCutoff = 0.5; % limit on geometric discrepancy
Query.Diff{2,1} = '>';
Query.Diff{3,2} = '>';
Query.Diff{4,3} = '>';
Query.Diff{5,4} = '>';
Query.Diff{6,5} = '>';
Query.ExcludeOverlap = 1; % exclude very similar candidates
Query.SearchFiles = {'1S72'};
oFR3DSearch
Let’s go through the lines which are new in this query. Query.Filename tells which PDB file contains the motif or fragment of interest, Query.NTList tells the nucleotide numbers, and because some PDB files contain multiple chains with overlapping nucleotide numbers, Query.ChainList tells the chain for each nucleotide. Query.DiscCutoff sets the maximum geometric discrepancy between a candidate and the query motif. Geometric discrepancy is somewhat like RMSD (root mean square deviation) and can be interpreted to have units of Angstroms. A discrepancy of 0.5 is a moderate number. Larger cutoff discrepancies will find more candidates but might take a long time to run. Query.ExcludeOverlap is a binary variable (0 or 1) which tells FR3D to exclude candidates which have many nucleotides in common, favoring the one having lower discrepancy with the query motif; 1 makes this happen, 0 tells FR3D not to bother.
The geometric discrepancy was defined in the 2008 paper FR3D: Finding Local and Composite Recurrent Structural Motifs in RNA 3D Structures, by Michael Sarver, Craig L. Zirbel, Jesse Stombaugh, Ali Mokdad, and Neocles B. Leontis, which is available at this link. It is a way of measuring how geometrically similar two sets of RNA nucleotides, provided that they have the same number of nucleotides. One gives two lists of RNA nucleotides, call them A1, A2, …, An, and B1, B2, …, Bn. Order is important, A1 is supposed to correspond to B1, A2 to B2, etc. It is important to allow for the possibility that A1 and B1 do not have the same base, so one needs to find a way to superimpose a G on a U, for example. The solution in the 2008 paper was to first calculate a geometric center of the heavy atoms for each of the n bases (see the figure below, taken from the 2008 paper) and find the rotation matrix which optimally rotates the centers of B1, B2, …, Bn onto the centers of A1, A2, …, An. The sum of the squares of the distances between corresponding centers is called the location error. After this superposition, the base in nucleotide A1 might not perfectly superimpose with the base in nucleotide B1, and so we calculate the angle (in radians) that would be needed to rotate one base to align with the other, using the standard orientations shown in the figure below. The sum of squares of the angles is called the orientation error. The discrepancy is then defined by , where L2 is the location error and A2 is the orientation error. Dividing by n makes this a discrepancy per nucleotide, which has roughly the same meaning over a range of motif sizes.
An important characteristic of FR3D searches is that it is guaranteed to find all candidates whose geometric discrepancy with the query motif is less than or equal to the cutoff discrepancy, no matter what the chain connectivity is between the nucleotides in the candidate. In other words, a FR3D geometric search does not make any assumptions about the chain continuity of the nucleotides in the candidates, unless the user gives such constraints. In principle, if the query motif contains n nucleotides and we are searching in a structure having m nucleotides, there are roughly mn possible sets of n-nucleotide candidates, and none of them are excluded a priori. In practice, nucleotides that are more than 30 Angstroms apart cannot be in the same candidate, but that still leaves a large number of potential candidates to search through. The actual search procedure is described in the 2008 paper and is based on the fact that if A1, A2, …, An, and B1, B2, …, Bn are sets of nucleotides and Bi and Bj are much further apart than Ai and Aj are, then the geometric discrepancy between A1, A2, …, An, and B1, B2, …, Bn will be high, and we can be sure that B1, B2, …, Bn is not a good match to A1, A2, …, An, just by looking at those two nucleotides. Using pairwise distance checks, we can eliminate the vast majority of possible candidates and then calculate the full geometric discrepancy for those that are left, again rejecting those that are above the discrepancy cutoff.
Imposing pairwise symbolic constraints only serves to reject more candidates, and so speeds up the search. Generally speaking, any symbolic constraint that you can use which does not fundamentally constrain the search is a good one to use. Thus, for example, using the directionality constraints in the GNRA search above does not rule out any GNRA candidates, but does speed up the search quite a bit. On the other hand, if you remove those constraints, you find two new hairpins which are not unlike GNRA hairpins and which have a different chain connectivity order.
Today we are going to use a FR3D search to find 3-way junctions, primarily in ribosomes, diagram their interactions, explore their conservation across bacteria, archaea, and eukaryotes, and look at the relative positions of their outgoing helices.
We begin by writing a FR3D search to find 3-way junctions with more than two nucleotides on each single strand. We are looking for six nucleotides that form the last Watson-Crick basepairs before the junction, and that also border the single stranded regions between the helices. The search should start like this:
clear Query % remove any previous Query parameters
Query.Name = 'Six flanking nucleotides of a 3-way junction';
Query.Edges{1,2} = 'cWW';
Query.Edges{2,3} = 'BorderSS';
Query.Diff{3,2} = '>';
Query.SearchFiles = {'Ribosome_list'};
oFR3DSearch
return
Your search should allow for all three symmetries of a 3-way junction, so that each instance is found 3 times. This will allow us to find junctions with similar geometries even though they occur with different strand orders.
Before running the search, please add the new yeast mitochondrial ribosomal large subunit 1VW3 to the file Ribosome_list in the folder FR3D/PDBFiles. We should be searching this structure whenever we search ribosomal structures. Also, you can download the Matlab binary file with all of the FR3D analysis of 1VW3 at this link (or update the Github FR3D repository on the dev branch). Save 1VW3.mat in FR3D/PrecomputedData. The circular diagram can be found at this link.
When I ran the FR3D search, it found 315 candidates. Please choose Sort by centrality and then, when it is done, Sort by centrality again. This will calculate geometric discrepancies between all 315 candidates. Then Quit navigation (to save the discrepancies) and oDisplay to start the display again. Now choose Order by Similarity. Figure 99 will show you a few clusters, the largest of which seems to be repeated 3 times. This is because each 3-way junction appears three times in the search results, but the three appearances do not superimpose on one another. Choose Order by Similarity again to select just one instance of each 3-way junction in a way that preserves the clusters.
Focus on the largest cluster by choosing Navigate with Figure 99 and then clicking on the upper right hand corner of the cluster. This will mark those candidates, then choose Display marked only. Below is the heat map of mutual discrepancies that I get when I do this search. Let’s focus on the candidates in the lower right corner, since they are most geometrically similar.
Here you see candidates from the following structures:
Amazingly, there is one candidate from each of the large ribosomal subunit classes that have been solved to near atomic resolution (the 1VW3 structure is done by cryo electron microscopy).
Open the circular diagrams for each of these structures. These are found in the FR3D/CircularDiagrams folder from the large zip file that I provided at this link. Check to see if these 7 instances occur in the same place in the secondary structure, so that we can figure they are homologous (have a common ancestor, long, long ago).
For reference, here are the main ribosomal small subunit structures as of June 29, 2014:
Now start to diagram these 3-way junctions. Please arrange the nucleotides in clockwise order, starting with the longest strand with nucleotides listed vertically from bottom to top, on the left of the diagram. Start with a bacterium, then the mitochondrion, then the archaeon, then a eukaryote. If you draw the diagrams in the same way, it will make it easier to see similarities between the structures.
The next cluster with similar geometry has 3 instances from bacteria, right in the center of the heat map above. Choose one of those and diagram it in the same orientation as the ones you have just done. Also check to see where this 3-way junction is in the circular diagram.
Let’s continue with the large cluster of 7 instances of a 3-way junction from the previous class period. Ask these questions about the instances:
So far, we have been studying the largest cluster of 3-way junctions in a search that returned 315 instances. Let’s look at the other clusters and divide and conquer: each person choose one of the clusters and address the four questions above.
The previous search found 3-way junctions in which there are 3 flanking Watson-Crick basepairs and 3 single-stranded regions. The way the BorderSS relation is coded in FR3D, the single-stranded regions must have at least one nucleotide between the nucleotides making Watson-Crick basepairs. However, some junctions have one “empty” single-stranded region. Adjust the search above to find such junctions by using one constraint like this:
% Query.Edges{2,3} = ''; % no constraint on interaction between 2 and 3
Query.Diff{3,2} = '> =1'; % sequentially adjacent nucleotides
Note that you only need one constraint like this. There are 39 such 3-way junctions in the ribosome list. In many cases, you will find that bases 2 and 3 are stacked on each other, but not in all cases. Download the FR3D search results for the whole non-redundant list at this link.
Today we will start the class by looking at instances of internal loops that have the same sequence but different geometry. The instances can be viewed at this link. We will look at a few examples together, then I would like you to each take a block of instances and decide whether they all have substantially the same geometry or whether one of them differs in a significant way, especially when they make different basepairs.
The phrase “one sequence, one structure” is appealing for RNA because it would mean that once you know the sequence of an RNA, there is a unique secondary or 3D structure associated with it. We know that this isn’t strictly true, because we know about riboswitches that change their secondary and 3D structure when a ligand binds. On the other hand, with Watson-Crick complementary sequences, we pretty well expect that “one sequence, one structure” will hold. What about other small structural units like internal loops? If “one sequence, one structure” holds, then we ought to be able to predict the 3D structure of these units fairly accurately. In some cases, this assumption seems to be justified, but in others, one sequence can have a variety of 3D structures.
I have collected together many instances of internal loops from the RNA 3D Motif Atlas on this web page: http://rna.bgsu.edu/experiments/jsmol/IL_with_identical_sequences.htm They are organized into sets which have the same “interior” sequence (the bases between the flanking Watson-Crick pairs). Within each set, they are ordered so that sequences with the same flanking bases are near each other. Each line tells the sequence of the internal loop, the loop ID, the motif group that it comes from in the RNA 3D Motif Atlas, and the base and nucleotide number of one instance in the loop. Notice that in the first 870 instances, there are always at least two motif groups listed, which suggests that these instances may have different geometries. I have added annotations to the first several sets to describe the differences in geometry. Please read those and look at a few of the instances to learn how I’ve annotated them.
The assignment is for you to annotate other sets of internal loop instances in the same way. You can print out lines of text and write your annotations on paper or, better yet, view the page source, copy it into an editor, and add your annotations to the end of each line as I have done. Then I can paste your annotations into the source of the web page and make this a useful resource. Neocles Leontis is pretty excited about writing a paper about these examples. If you are interested in working on a paper like this, let me know.
Here are things to look for.
This workshop will entail short, informative lectures on basic principles of RNA structure, interactions, and
structural motifs, interspersed with hands‐on visualization and analysis of RNA 3D structures, using free software and online databases. Participants will learn how to find high quality 3D structures for RNA molecules of interest using resources we have developed in partnership with the Nucleic Acid Database (NDB). We will provide tutorial guidance in the use of 3D viewers like Swiss PDB Viewer to view and analyze RNA and RNA‐protein 3D structures. Participants will be introduced to recurrent interactions in RNA 3D structures (basepairs, base stacking, base‐backbone, and RNA‐protein interactions) and how to obtain annotations of these interactions for any RNA structure in PDB/NDB. We will discuss the constraints that these interactions put on sequence variability, and how to use those constraints to design experiments to investigate hypothesized interactions or to design RNAs with desired 3D structures.
Enumerating Basepairs using Triangle Bases
Start at https://spdbv.vital-it.ch/disclaim.html On the next page, click the link for Microsoft Windows. This downloads a .zip file. Extract, then move the folder SPDBV_4.10_PC to a folder on your computer for programs. Inside the folder, double click spdbv.exe to run the program. Close the "thanks" window, then use the menu to open a PDB file.
Start at this link: ftp://ftp.vital-it.ch/tools/SPDBV/ . For those using OSX 10.11 (El Capitan) and earlier versions, click on SPDV_4.1.0_OSX.zip to download it while those using OSX 10.12 (Sierra) or higher should click on SPDV_4.1.1_OSX.zip to download it. Once the download is complete, double-click on the file to unzip it and then move the folder SPDBV_4.1.0_OSX or SPDBV_4.1.1_OSX to your application folder (where you put programs!). Inside the Application folder, double click the Swiss-PdbViewer icon to run the program:
Users of OSX 10.12/10.13 might see the following error (rotolib.aa error) when they try to open Swiss PDB-viewer. Follow the instructions in this page to resolve the problem: https://kb.unca.edu/help/how-to-articles/swiss-pdb-viewer
PDB file to open to test SwissPDB viewer: We have colored the helices in the tRNA in PDB file 1EHZ, available at this link.
Sample tRNA PDB file to download at this link.