Statistical Mechanics – NVE and NVT ensembles

Introduction

In statistical mechanics, we were taught tend to think of the system in at least two ways: NVE – the microcanonical ensemble, and NVT – the canonical ensemble. In the NVE ensemble, the Number of atoms, the Volume of the space, and the total Energy (Hamiltonian) of the system are kept constant. In the NVT ensemble, the Number of atoms, the Volume of the space, and the average kinetic energy (Temperature) are kept constant. An MD simulation can be conducted in either ensemble, resulting in a trajectory of points.

Methodology

Since we wrote up a little MD engine a while back, why not put it to use and showcase some of the key differences? For an NVE simulation, the previous code does exactly what it needs to – the symplectic Velocity Verlet integrator conserves the total energy. However, for an NVT simulation, we use a heat-bath (otherwise known as a thermostat) to impart random motion on to the system in order to conserve average kinetic energy. In this case, we implement the Andersen thermostat by modifying the code with an if statement:

void velocity_verlet(FLOAT_TYPE x0,  FLOAT_TYPE y0,
                     FLOAT_TYPE vx0, FLOAT_TYPE vy0) {

        // ... same as before 

        x += vx*delt + delt*delt*fx/2;
        y += vy*delt + delt*delt*fy/2;

        fx_delt = -1*p_deriv_CD(x,y,0);
        fy_delt = -1*p_deriv_CD(x,y,1);

        if( collide() == false ) { //<--- collision detection

               vx += delt*(fx + fx_delt)/2;
               vy += delt*(fy + fy_delt)/2;

        }
        else {
               vx = Normal(sigma);
               vy = Normal(sigma);
        }

        // ... same as before
}

Here we set the components of the particle velocity vectors by sampling from the Normal distribution. There are different ways to implement collide(), one way is to simply use a random number generator between 0 and 1 and test the value against a threshold frequency.

Trajectories

One of the key observations we immediately make from this simulation is that in an NVE ensemble, more points are higher in the high potential energy regions. However, in the NVT ensemble, more points are found in low potential energy regions (minimas). Incredibly cool.


V(x,y) = \cos(x) + \cos(y)


V(x,y) = \cos^2(y) + 1/(\cos(x)+2)


Special thanks to Fred for helping generate some of these figures!

Protected: Discrete Divergence Theorem

This post is password protected. To view it please enter your password below:


Protected: Data Structures in Parallel Computing: An Open Problem

This post is password protected. To view it please enter your password below:


Re-release of the RMSD kernel

I’ve decided to make a much neater interface to the RMSD kernel and re-release it. It does one simple thing: calculate the “all against one” rmsd. In other words, given a set S of N=|S| conformers, and a particular conformer x \in S, it evaluates rmsd(x,y), \forall y \in S

So far, this is supported on 64-bit Linux OSes and Nvidia cards with compute capability > 2.0. This assumes you have the nvidia libraries This no longer needs the nvidia libraries or the nvcc compiler if linking against the .so (shared library).

Here’s a sample input file: rand_conf.dat.

The first row indicates the number of atoms in the system followed by the number of conformers. The remaining rows represent:

Conf1_Atom1_X Conf1_Atom1_Y Conf1_Atom1_Z
Conf1_Atom2_X Conf1_Atom2_Y Conf1_Atom2_Z
Conf1_Atom3_X Conf1_Atom3_Y Conf1_Atom3_Z
...etc
Conf10_Atom3_X Conf10_Atom3_Y Conf10_Atom3_Z
Conf10_Atom4_X Conf10_Atom4_Y Conf10_Atom4_Z
Conf10_Atom5_X Conf10_Atom5_Y Conf10_Atom5_Z

Here is some sample code showing how to use it.

//file: main_test.cpp
#include <iostream>
#include "dataIO.h"
#include "RMSD.cuh"
int main() {
    uint numConfs;
    uint numAtoms;
    // Read this input file into ADP format!
    float *h_X  = read_fileADP("rand_conf.dat", numConfs, numAtoms); 

    // Construct an RMSD object
    Rmsd r(numAtoms, numConfs, h_X);

    // Calculate RMSD
    vector<float> * rmsds_once = r.all_against_one(2);
    vector<float>::iterator it;
    r.print_params();
    for(it = rmsds_once->begin(); it!= rmsds_once->end(); it++) {
        std::cout << *it << "\n";
    }   

    delete rmsds_once;
    // yes this is quirky - but we want to use arrays for
    // this because we want to hint to the user that it needs
    // to be carefully managed!
    free(h_X);
}

To compile the code with your nvcc or g++ compiler:

g++ dataIO.cc
nvcc main_test.cpp RMSD.o dataIO.o
//or
g++ main_test.cpp RMSD.so dataIO.o

You don’t have to use the read_fileADP methods. You can supply the Rmsd constructor with any pointer to a chunk of memory of coordinates stored in ADP format. h_X, the 1-d storage array for all the coordinates, must be stored in memory in ADP format. This “strange” format is very important to support fast coalesced reads and to minimize cache-misses.

To read/write the coordinate of Ath atom, in the Dth (d=x,y,z=0,1,2) dimension, of the Pth conformer:

h_X[A * numConfs * numDims + D * numConfs + P] // where numDims = 3

And voila! you can hook this to whatever you want. Note this implementation uses a mixed floating/double optimization of Theobald’s QCP method. The squared RMSD is accurate up to ~0.002 units of precision. If you really need a more accurate method, or the source code, please let me know.

Edit: Forgot the Download Link.

A simple MD tutorial

Modern molecular dynamics tends to be so vastly complicated that we often get lost in the myriad of details. The goal of this tutorial is to show the readers how to build a generalized MD simulation from scratch, and in the process, distill the fundamentals behind MD. Note that we don’t describe the actual physical system, as we want to be generic. Code snippets are written for C++.


Step 1. Given a physical system, describe your coordinate space and prescribe a potential V on this coordinate space.

For purposes of illustration. Let the domain of my potential be S = \{ [a,b] \mid a,b \in [0,2\pi] \}. Thus, S is the flat 2-torus. Define V: S \rightarrow \mathbb{R}, where explicitly,

(1)   \begin{equation*} V(x,y) = cos(x) + cos(y) \end{equation*}

The potential code is very simple:

double potential(double x, double y) {
    //replace later with any potential of our choosing
    double V = cos(x) + cos(y);
    return V;
}

Step 2. Find the partial derivatives of V.

Analytically, the solutions are:

(2)   \begin{eqnarray*} \frac{\partial V}{\partial x} = -sin(x) \\ \frac{\partial V}{\partial y} = -sin(y) \end{eqnarray*}

In practice, it is usually impractical to derive this analytically. It is usually calculated numerically using a central difference scheme:

double p_deriv_CD(double x, double y, int dim) {
    double delt = 1e-6; // difference step
    double val = 0;
    if(dim == 0) {
        double v_prev = potential(x-delt,y);
        double v_forw = potential(x+delt,y);
        return (v_forw - v_prev)/(2*delt);
    }
    else {
        double v_prev = potential(x,y-delt);
        double v_forw = potential(x,y+delt);
        return (v_forw - v_prev)/(2*delt);
    }
}

Step 3. Set up your ordinary differential equation (ODE) to create the equations of motion.

This is the heart of molecular dynamics. The form of your differential equation depends upon your coordinate system and choice of ensemble. However, only in exceptional circumstances can the ODE be solved analytically. Realistically, molecules have constraints that fixate bond lengths of angles. There are two ways to deal with constraints.

The popular approach is that for each time step, assume constraints do not exist and therefore simply integrate the equations using Newton’s equations of motion. Afterwards, we get some new position x'_{t+1}. Then we do an update step using a constraint algorithm to correct x'_{t+1} to the proper position x_{t+1}. The other approach, and often hailed as esoteric and overly complicated, is to use Lagrangian mechanics, which in brief, integrates number of equations of motions and do not need an update, but each equation in the system of ODEs is extremely complicated – because the form of the kinetic energy is no longer as trivial as \frac{1}{2}mv^2.

In our example, we assume that our equations of motion are constraint-free. In particular, we use the NVE ensemble (constant number of atoms, volume, and energy), note that most simulations nowadays utilizes the NVT ensemble (constant number of atoms, volume, and temperature). We use the following set of second order non-linear differential equations, assuming masses of unity:

(3)   \begin{eqnarray*} \frac{\partial V}{\partial x} = -sin(x) = -\frac{d^2x}{dt^2} \\ \frac{\partial V}{\partial y} = -sin(y) = -\frac{d^2y}{dt^2} \\ \end{eqnarray*}


Step 4. Integrate the equations of motion.

In general, the ODEs tend to be a system of ordinary differential equations that must be solved numerically. We typically call these solvers integrators. If you have taken a course on numerical computing, you’ll probably have learnt the Euler method, the midpoint method, Heun’s method, Runge-Kutta, etc. However, not all integrators are suitable as they must conserve energy (and preserve “area”), and these collectively, are called symplectic integrators. The most commonly used sympletic integrators are Basic Verlet, Velocity Verlet, and Leapfrog.

This tutorial implements velocity verlet, a second-order integrator. The input into velocity verlet is a initial position vector and a initial velocity vector. During each time-step, that is, to integrate the equations of motion, we first calculate the current force. Then we calculate the position at time t+1 using the current position, current velocity, and current force. Now we have the position at time t+1. Then we calculate the new forces at this new position, giving us the force at time t+1. Using the current force and force at time t+1, we can calculate the velocity at the next time-step. In detail,

As a matter of notational convenience:

(4)   \begin{eqnarray*} f_x(x,y) \equiv \frac{-\partial V(x,y)}{\partial x} \\ f_y(x,y) \equiv \frac{-\partial V(x,y)}{\partial x} \end{eqnarray*}

The velocity verlet scheme then proceeds as follows:

(5)   \begin{eqnarray*} x(t+dt) &=& x(t) + vx(t)dt + \frac{f_x(x(t),y(t))}{2} dt^2 \\ y(t+dt) &=& y(t) + vy(t)dt + \frac{f_y(x(t),y(t))}{2} dt^2 \\ vx(t+dt) &=& vx(t) + \frac{f_x(x(t),y(t))+f_x(x(t+dt),y(t+dt))}{2} dt \\ vy(t+dt) &=& vy(t) + \frac{f_y(x(t),y(t))+f_y(x(t+dt),y(t+dt))}{2} dt \end{eqnarray*}

The code to implement the whole thing is as follows:

void velocity_verlet(FLOAT_TYPE x0,  FLOAT_TYPE y0,
                     FLOAT_TYPE vx0, FLOAT_TYPE vy0) {

    double delt = 0.0005;
    int MAX_STEPS = 500000;

    double x = x0;
    double y = y0;
    double vx = vx0;
    double vy = vy0;

    // xx - next step (t+dt)
    // x - current step (t)
    // xv - current velocity (t)

    for(double t = 0; t<delt*MAX_STEPS; t+=delt) {
        if(x > 2*PI)
            x -= 2*PI;
        if(x < 0)
            x += 2*PI;
        if(y > 2*PI)
            y -= 2*PI;
        if(y < 0)
            y += 2*PI;

        // calculate potential energy
        float e_tot = 0.5*vx*vx + 0.5*vy*vy + potential(x,y);

        fx = -1*p_deriv_CD(x,y,0);
        fy = -1*p_deriv_CD(x,y,1);

        x += vx*delt + delt*delt*fx/2;
        y += vy*delt + delt*delt*fy/2;

        fx_delt = -1*p_deriv_CD(x,y,0);
        fy_delt = -1*p_deriv_CD(x,y,1);

        vx += delt*(fx + fx_delt)/2;
        vy += delt*(fy + fy_delt)/2;
    }
}

Step 5. Plot your data and check your trajectories.

To make sure that you do not go insane while writing MD code – it is important that we check for at least two things: 1, the trajectories has to look reasonable. 2, the energy must be conserved.

The following figure shows the trajectory with symmetrical starting points:

(x_0,y_0) = (2,2) , (vx_0,vy_0)=(0,0) , dt = 0.0005 , steps = 50000

Well, this is not very interesting. Now suppose instead, we started with a non-symmetric initial position:

(x_0,y_0) = (2,1.5) , (vx_0,vy_0)=(0,0) , dt = 0.0005 , steps = 50000

Now things are getting interesting. The first behavior we immediately notice is that there is no nice oscillation pattern (as was the case in the symmetric case). Let’s increase the length of the simulation:

(x_0,y_0) = (2,1.5) , (vx_0,vy_0)=(0,0) , dt = 0.0005 , steps = 500000

Figure 1.

Lo and be hold, the second behavior we notice is that the set of positions seem to be bounded by the rectangle [2,2\pi-2] \times [1.5, 2\pi-1.5]. Imagine plotting out the level curve (an implicit function):

(6)   \begin{equation*} cos(x)+cos(y)=cos(x_0)+cos(y_0) \end{equation*}


All points with this region can visited and not violate the conservation of energy. Yet, our system only chooses to visit a subset of the possible states. But this makes sense for the following reason. To calculate x(t+dt), we need information of x(t), xv(t), and f_x(x(t),y(t)). The xv(t) term depends on f_x(x(t-dt),y(t-dt)), and f_x(x(t),y(t)). However, f_x in general depends only x, and not on y. Thus, the calculation of x(t+dt) is also independent of y. In other words, the trajectory along the x-dimension is independent of the trajectory along the y-dimension because the partial derivatives are independent.

A third behavior that becomes immediately clear is that the system spends most of its time on the outer ring. The density of points in the minima region is in fact, quite small. Again, by conservation of total energy, this makes sense because at regions of high potential energy, the kinetic energy is low; and at regions of low potential energy, the kinetic energy is high. A higher kinetic energy implies a higher velocity at that point, which implies that the system is moving very fast through that point.

And there we go! Your own little MD engine in less than 300 lines of code.


Addendum

A holonomic constraint can be written in the closed form:

(7)   \begin{equation*} f(q_1, q_2, q_3,\ldots, q_{n}, t) = 0 \end{equation*}

For example, bond length constraints and bond angle constraints:

(8)   \begin{eqnarray*} |\vec{p} - \vec{q}| + k &=& 0 \\ \frac{\vec{p} \cdot \vec{q}}{|\vec{p}||\vec{q}|} + cos(\phi) &=& 0 \end{eqnarray*}

Some other cool plots:

V(x,y) = cos(x)cos(y)+cos(2x)cos(2y)

(x_0,y_0) = (0.1,0.2) , (vx_0,vy_0)=(0,0) , dt = 0.0001 , steps = 1000000

Top down and side view


V(x,y) = cos(y)^2+1/(cos(x)+2)

(x_0,y_0) = (3.14,2.9) , (vx_0,vy_0)=(0,0) , dt = 0.001 , steps = 1000000


Convergent Potentials:

Figure 2.


In the future:

-Parallelizing Velocity Verlet
-Derivation of Lagrangian Mechanics
-More Code!

On the notion of distance in proteins – Part 1

This is the introductory post on a difficult, yet fundamental, issue in protein folding, structure, and chemistry in general.

One thing that has always bothered many scientists to great lengths [no pun intended] is the notion of the metric used in calculations of “distances” between two conformers. In clustering, alignment, etc. often our goal is to minimize this “distance” and we try to provide some rationality to its interpretation. If the metric function between two conformers x and y is small, we often tend to say things like:

-conformer x is a lot like conformer
-conformer x can get to conformer y easily.
etc.

Recall that we can endow a manifold M with a metric d: M \rightarrow R^+ of our choosing, so long as it satisfies the following properties:

\forall x,y,z \in M
1. d(x,y) \geq 0 (non-negativity)
2. d(x,y) = 0 if and only if x = y (identity of indiscernibles)
3. d(x,y) = d(y,x) (symmetry)
4. d(x,z) \leq d(x,y) + d(y,z) (triangle inequality)

In the case of protein structure, one way is to look at the structure (say with N carbon-\alpha atoms), and define the metric as follows:

Let M be the euclidean R^{3,N} space, x,y \in M, x=(x_1,x_2,..,x_N), y=(y_1,y_2,..,y_N), endowed with the standard Euclidean metric, the l^2 norm, d, defined as follows:

(1)   \begin{equation*} d(x,y) = \sqrt{\sum_{i=1}^N (x_i-y_i)^2}$ \end{equation*}

The most commonly used metric in measure similarity of two protein conformers is the root mean squared deviation, herein known as simply rmsd. It is just the Euclidean metric modified by a \sqrt{1/N} term. The reader can verify that rmsd satisfies the previous four properties of metric space.

(2)   \begin{equation*} rmsd(x,y) = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i-y_i)^2}$ \end{equation*}

One immediate problem arises with this definition.

Consider a conformer x \in R^{3,N}, we subject it to a linear translation R and a rotation E to produce a vector \hat{x} \in R^{3,N}, with R,E, being the 3×3 translation and the Euler rotation matrix. ie., R + E x = \hat{x}

Now clearly, \hat{x} is the same structure as x, we haven’t changed it at all! We just rotated and translated x a few times to get \hat{x}! Yet if we again, evaluate rmsd(x,\hat{x}), we are most likely to get a completely different rmsd value. Hence, for this reason, the structural biologists came up with rmsd_{opt}, defined more or less unrigorously, as follows:

let Y \subset R^{3,N} be the set of all possible translation and rotations of a y \in R^{3,N}

(3)   \begin{equation*} rmsd_{opt}(x,y) = \min_{y \in Y}\sqrt{\frac{1}{N} \sum_{i=1}^N (x_i-y_i)^2}$ \end{equation*}

Essentially, it considers all possible rotations and translations of y, and calculates the rmsd against a/the y_x that best superimposes onto x (question to readers – prove or disprove that the optimally superimposed vector, y_x, is unique).

Finding y_x is a non-trivial problem and it is difficult to discuss on a superficial level. The method I use in my code is Doug Theobald’s QCP method.

An astute reader may have already noticed that rmsd_{opt} does not quite play well with property #2 (identity of indiscernibles), as if we go back to our discussion of translation and rotations, you’ll see that rmsd_{opt}(x, \hat{x})=0, even though x \neq \hat{x} as they clearly have different coordinates.

Hence a highly recommended paper is Boris Steipe’s proof showing that rmsd_{opt} also satisfies the four properties of metric spaces, if we choose to define our metric using equivalence classes (brief: all vectors within an equivalence class can be transformed from one to another using a rotation and a translation matrix. So we do a comparison between equivalence classes in order to satisfy #2).

Part 2 will talk about energetics and fundamentally deeper problems that arise even when using rmsd_{opt}. Stay tuned!

Genetic screens and selections

In previous posts I’ve talked about the two fundamental steps involved in directed evolution: 1) generating diversity in a protein through mutagenesis, and 2) finding the variants in this library that have improvements in a property of interest. A challenging question, then, is out of the many thousands of mutated proteins that we create, how do we find the one that does that we want? The answer is through either a genetic screen or selection.

A genetic screen involves individually picking colonies (either manually or with a robot), growing them in conditions that allow the protein of interest to express, lysing the cells, and then assaying the lysate for a particular activity. Protein activity assays typically involve incubating the protein lysate with a substrate of interest, derivatizing the expected product with a colorimetrically active chemical, and then looking for a colorimetric shift using a spectrophotometer. For example, compounds like Brady’s reagent (2,4-Dinitrophenylhydrazine) can be used to detect formation of carbonyl groups, which can be formed by dehydratases or dehydrogenases.

Genetic screens tend to be limited to library sizes of ~10^4 . This is primarily due to the human factor in screening: even with robots, one has to personally supervise the colony picking, the liquid transfers when growing and lysing cells and conducting the protein assays. Often screens are conducted using 96-well plates, and equipment speed and capacity becomes an issue when you want to screen upwards of 4000 protein variants at a time.

Genetic selections, on the other hand, can easily assay 10^8 to 10^10 variants of a protein of interest. The trick here is to find conditions that only allow cells containing proteins with the property of interest to survive. Then, it is easy to just plate as many cells as possible, and find the few colonies that can grow. The limiting factor becomes the amount of cells you can successfully transform with the plasmid that expresses your protein of interest. In yeast, the upper bound is ~10^8, while in E. coli it is ~10^10.

Genetic screens can be effective, but only if there is a way to correlate cell survivability with a particular protein function. If you’re evolving a protein involved in an essential metabolic pathway or in antibacterial resistance, this can be easy, but often we’re more interested in biofuels or fine chemical synthesis, which can actually oppose cell survivability.

Deciding on whether to use a genetic screen or selection, and the precise method for each, is probably the most important part of beginning a directed evolution project. In fact, a commonly heard mantra in the field is “You get what you screen for”. If your assay is set up the wrong way, you may get completely useless proteins.

GPU Rmsd

Released the “alpha” version of the rmsd library. Please check the “Code” section at the top of the page.

Brief Update

I am preparing to release the k-centers and rmsd-kernel source libraries (soon™).
In academia everything tends to come with a paper, which is delaying the release a bit.

If you would like early-access – please send me an e-mail: proteneer [AT] gmail

A blast from the past

I worked on the Fold It! project in the summer of 2008, and it was one of the most rewarding experiences of my life. For those who do not know what FoldIt! is: it is a “game” based off of tools from the Rosetta@Home engine that enables even a high school student to fold a protein using a set of intuitive tools.

Although at first glance it seems like a game, but one of its major goals is actually to use power human spatial reasoning to see “what fits where” in protein folding. Every other year, the world’s top structural biologists, crystallographers, and computational biochemists gather for the annual Critical Assessment of Structure Prediction competition. During the competition, each group does their best to make “blind” predictions given only an amino acid sequence and currently public available information, such as the Protein Data Bank. Afterwards, the official results from the crystallographers are released – and a critical assessment is done for each prediction method. In the Baker lab, some of the submitted predictions are the top scoring results from Fold It! players.

Recently, the group published an article in Nature Structural Biology called “Crystal structure of a monomeric retroviral protease solved by protein folding game players.” Where a fold it group called The Void Crushers came up with some great results.

(An aside: void denotes the region of space where there exists nothing, not even water, and theoretically, voids really shouldn’t exist in the protein core. In Fold It! we do our best to remove these voids as they incur a fairly heavy penalty to the energy score.)


(source: Nature/Baker Lab)

In the above diagram, the Yellow strand is the result from FoldIt! players. The Red strand is the initial rosetta template that the players were given to work with. The Blue strand is the “native” structure as determined by crystallography. The superposition clearly indicates that blue strand is much much closer to the native structure than the Server generated red structure.


(source: Nature/Baker Lab)

The next figure highlights the collaborative nature of the game. Three gamers/users: spvincent, grabhorn, and mimi, collaborated amongst themselves (and many other players) to improve the structure of one protein. Their results generated the biggest “jumps” in energy stabilization. Each player made subsequent improvements that lowered the energy of the system, either at different parts or times.

Who said gaming can’t be used to empower science?

Additional links:

FoldIt!
Nature Article
(free)
Baker Lab