Installing SSL certificates for the Tornado webserver

There are plenty of guides out there on how to install SSL certs for apache, etc., but I had a fun weekend struggling with Tornado. What the docs don’t tell you is that in the ssl_options dict in the constructor of TCPServer, you can also happen to pass in a ‘ca_certs’ kwarg to point to your CA’s root/intermediate chain! Most browsers won’t need it, but when using other python libraries (like requests), or curl, you will need to set it up explicitly! Else you will be in a world of pain trying to debug SSL!

Here’s the code:

tornado.httpserver.HTTPServer(instance, ssl_options={'certfile': cert_path, 'keyfile': key_path, 'ca_certs': ca_path})

Wrapping C++ classes with Cython, Python 3.3, Windows

Well this was a fun weekend. I tried out several libraries that could wrap C++ classes.

Boost::python – not headers only, really hard to strip out using bcp with a working build system.
Ctypes – does not support C++ classes all that well, but in the standard library
Cython – not in the standard library
Swig – libraries are bad enough as a dependency. A whole freaking executable is worse (in theory). In practice, we’ve used it quite extensively with OpenMM.

I wanted to try out Cython, because it sits in the very interesting place of being a hybrid C++/Python language to start with, so a lot of features are fully baked in. The script below sets things up.

# setup.py 
# notes: cythonize examples are horrendously broken
# On Windows, you need to execute:
# set VS90COMNTOOLS=%VS100COMNTOOLS%
# python setup.py build_ext --compiler=msvc
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
import glob

sources = glob.glob('../source/*.cpp')
sources.append('terran.pyx')

ext_modules = [Extension('terran',
                        include_dirs = ['../include'],
                        language = 'c++',
                        sources = sources,
                        define_macros = [('TERRAN_BUILDING_SHARED_LIBRARY', '1')],
                        )]
                        
setup(
    name = 'TERRAN',
    cmdclass = {'build_ext': build_ext},
    ext_modules = ext_modules
)

I have two C++ classes I want to wrap, Cluster and ClusterTree, one of the methods in ClusterTree instantiates a Cluster object and returns a reference to it (ownership of returned Cluster object still belongs to ClusterTree). But I’d also like Cluster to be able to be created natively. I tinkered around with @classmethods coupled with an empty __init__(), but decided that it resulted in horrendously ugly and non-intuitive code. Instead, I’ve come up with the following solution:

# terran.pyx
from libcpp cimport bool
from libcpp.vector cimport vector
        
cdef extern from "../include/Cluster.h" namespace "Terran":
    cdef cppclass Cluster:
        Cluster(vector[vector[double]],vector[int]) except +

cdef class PyCluster:
    # important: these are instance variables
    cdef Cluster* __thisptr
    cdef int __delete
    def __cinit__(self, *args):
        if len(args) == 2:
            data = <vector[vector[double]]?> args[0] # dynamic cast with safety check
            period = <vector[int]?> args[1]          # dynamic cast with safety check
            self.__thisptr = new Cluster(data, period)
            self.__delete = 1
        
    cdef __initFromRawPointer(self, Cluster* cptr):
        self.__thisptr = cptr
        self.__delete = 0

    def __dealloc__(self):
        if self.__delete:
            del self.__thisptr

cdef extern from "../include/ClusterTree.h" namespace "Terran":
    cdef cppclass ClusterTree:
        ClusterTree(vector[vector[double]],vector[int]) except +
        Cluster& getCurrentCluster()
        
cdef class PyClusterTree:
    
    cdef ClusterTree *__thisptr
    
    def __cinit__(self, vector[vector[double]] data, vector[int] period):
        self.__thisptr = new ClusterTree(data,period)
                
    def __dealloc__(self):
        del self.__thisptr
              
    def get_current_cluster(self):
        pyc = PyCluster()
        cdef Cluster* ptr = &(self.__thisptr.getCurrentCluster())
        pyc.__initFromRawPointer(ptr)
        return pyc

This fairly elegant. A single initializer is exposed. Pointers are handled correctly with proper ownership semantics. The automatic mapping from lists of lists, or 2D numpy arrays to vector[vector[double]] is absolutely beautiful. One thing to be really careful about is that cdef’d __thisptr actually is an instance variable, and not a class attribute. This is because Cython is statically compiled, so you can’t just arbitrarily add new instance variables (or class attributes) at runtime.

Core 17 back to Beta!

We’ve updated Core 17 with OpenMM 5.1, so checkout the release video for more info:

A live Q&A is available on reddit.

Some of the key highlights are:
-Up to 120,000 PPD on GTX Titan, and 110,000 PPD on HD 7970
-Support for more diverse simulations
-Linux support on NVIDIA cards and 64bit OSes
-FAHBench updated to use the latest OpenMM and display version information

Full Transcript of the Talk:

Hi I’m Yutong, I’m a GPU core developer here at Folding@home. Today I want to give you guys an update on what we’ve been working on over the past few months. Let’s take a look at the three major components of GPU core development. First off, we have OpenMM, our open source library for MD simulations. It’s used by both FAHBench and Core17. FAHBench is our official benchmarking tool for GPUs, and it supports all OpenCL compatible devices. We’re very happy to tell you guys that it’s been recently added to Anandtech’s GPU test suite. And Core17 is what your Folding@home clients use to do science. By the way, all those arrows just mean that the entire development process is interconnected.

So let’s take a step back in time.

Last year in October, we conceived Core 17. And we had three major goals in mind. We wanted a core that was going to be faster, more stable, and to be able to support more types of simulations than just implicit solvent. But because of how our old core 15 and 16 was written, it was in fact easier for us to write the core from scratch.

So in November, we started rewriting some of the key parts to replace some pre-existing functionality. Over two months, in January, things started to come together. Our work server, assignment server, and client was modified to support Core 17. We also started an internal test team, for the first time ever, using an IRC channel on freenode to provide real-time testing feedback.

In February, Core17 had a public Beta of over 1000 GPUs. And We learnt a lot of valuable things. One of them was that the core wasn’t all that much faster it seems on NVIDIA. Though on AMD things certainly looked brighter. Things still crashed occasionally, and bugs were certainly still present. So we went back to the drawing board to improve the core.

In April, we added a lot of new optimizations and bug fixes to OpenMM. We tested a linux core for the first time ever on GPUs. And our internal testing team had grown to over 30 people. And that brings us to today.

We now support many more types of simulations, ranging from explicit solvent to large systems of up to 100,000 atoms. We improved the stability of our cores. We now have a sustainable code base. We added support for linux for the first time. It’s also really fast – so I’m sure the burning question on your mind is, just how fast is it? Well let’s take a look. On the GTX Titan we saw it from 50,000 points per day to over 120,000 points per day. On the GTX 680, we saw it go from 30,000 points per day to over 80,000 points per day. On the AMD HD 7970, we saw it from 10,000 points per day to over 110,000 points per day. On the AMD HD 7870 we saw it jump from 5,000 points per day to over 50,000 points per day.

We never want to rest on our laurels for too long. We are already planning support for more Intel devices in the future, such as the i7s, integrated graphics cards, and Xeon Phis. We plan to add more projects to Folding@home as time goes on, so researchers within our group can investigate more systems of interest. And as always, we want things to be faster.

Now let’s go back to the beginning again, and here’s you guys can help us. If you’re a programmer, we invite you to contribute to the open source OpenMM project (available on github at the end of the month on github.com/simtk/openmm). If you’re an enthusiast and like to build state-of the-art computers, we encourage you to run FAHBench and join our internal testing team on freenode. If you’re a donor, we’d like you guys to help us spread the word about Folding@home and bring more people, and their machines of course. Now before I wrap things up, there are some people I’d like to thank. Our internal testers are on the right hand side, and they’ve been instrumental in providing me with real time feedback regarding our tests. We couldn’t have done it this fast without them. On the left hand side, are people within the Pande Group, Joseph and Peter are also programmers like me. Diwakar and TJ helped set up many of our projects. Christian and Robert have always been there for support and feedback.

But wait, one last thing. This week, I’ll be doing a Questions and Answers session on reddit at reddit.com/r/folding. So if you’ve got questions, come drop by and hang out with us. Thanks, and bye-bye.

FAHBench 1.0

We’ve released FAHBench 1.0, with a new slick GUI that should make it much more accessible to new comers. Click on the FAHBench link above or the image below to try it out! Don’t worry, it maintains backwards compatibility with the old command line interface.

FAHBENCH

Folding@Home Core 17 – Live Beta!

We are happy to announce that Folding@Home Core 17 has entered Beta. Externally, you probably won’t notice too much of a difference. Internally, this is a complete overhaul that brings many new features, and sets a strong foundation for the future of GPU core development. In addition, the restructuring brings much tighter integration of the core with the rest of the development within Folding@Home.

We’re also introducing an explicit solvent project (p7661) as part of the Beta. To keep the credit assigned to these projects consistent with previous explicit solvent work units run on CPUs, we are also awarding a quick return bonus with a k-factor of 0.75. This reflects the additional scientific value of these units, and keeps the Folding@Home credit awards consistent across different architectures.

Usage:
This is a still a very new core, a lot of the features have yet to be fully tested. Thus, as is the beta policy, no official support is given. You must enable the -beta flag on FAHClient, ie. set client-type=beta. If you’re using client 7.2.x or earlier, there are two options:
1. Specify -gpu-vendor=XXX, where XXX is either nvidia or ati
2. If -gpu-vendor is not specified, the core will automatically guess the platformId.
Otherwise, 7.3.6 lets you specify the particular -gpu-vendor as an option
Supported NVIDIA cards: Fermi or better (Titan does not work atm, as NVIDIA needs to publish new OpenCL drivers)
Supported ATI cards: HD5000 or better
As always, please use the latest WHQL drivers (Win XP is NOT supported due to super-old AMD drivers).

Key Features:
Cleaner Code
We have deprecated several layers of GROMACs and other wrappers as the old architecture severely limited the types of simulations that can be run. Much of the work on the new core has been to replace existing features. The resulting code is now more streamlined and integrated. We also anticipate that this major re-design will allow us to introduce new features into the Folding@Home much faster.

Serialization
We have introduced a new serialization mechanism that allows Pande Group researchers to setup significantly more diverse simulations. Pande Group researchers can now easily setup jobs and projects using Python (with a much richer and easier to use set of libraries), while the core maintains its speed by virtue of being written in C++. We achieved this using a serialization technique, whereby all details of a simulation are encapsulated using a standardized format that is then be passed around between language barriers. This also drastically reduces the dependencies needed by the Work Server and other parts of Folding@Home.

A single unified core now runs both NVIDIA and AMD cards
Before we had two development branches for NVIDIA and AMD cards. It was a difficult and cumbersome task to debug and maintain. We couldn’t easily mix runs and gens produced by different GPU types. Now, using OpenCL, a single core supports not only AMD and NVIDIA, but theoretically any OpenCL-capable device.

Improved Stability
By reducing the amount of boilerplate code, we’ve also increased the robustness and stability of the core. The log files should also now be much more informative. There are also a lot of useful debugging features built right into the core to help PG developers nail down hard to find bugs.

Extra special thanks to our internal testers:
Jesse_V, k1wi, artoar_11, bollix, ChelseaOilman, bruce, Demonfang, Grandpa_01, EXT64, Flaschie, HayesK, jimerickson, mmonnin_, P5-133XL, Patriot, rhavern, sc0tty8, SodaAnt, SolidSteel144, EvilPenguin, art_l_j_PlanetAMD64, thews, cam51037, Pin, muziqaz, baz657, PantherX, QINSP, Schro, hootis

The Parallel Nonbonded with Cutoff Problem

One of the biggest bottlenecks in MD is the real space non-bonded interaction of N atoms. The problem asks for all pairs (i,j) of atoms S that are separated by a distance r, and will be sequently evaluated using some force function F(i,j). Usually, |S| is much smaller than N^2, and tends to N^2 only as r grows large. This problem is trivial to solve on single threaded CPU models, but is very much non-trivial on multithreaded GPU models.

The fundamental abstract parallel operation is as follows: let A and B be two not necessarily disjoint sets of equal size, we compute A \cdot B number of interactions. That is, by loading in O(|A|) information, we do O(|A|)^2 work. This permits us to maintain an arithmetic work to memory access ratio of at least |A|.

In reality, on the GPU, we set the size of each of the sets A and B to 32 (number of threads per warp on NVIDIA GPUs) so we don’t need to explicitly synchronize within a non-bonded calculation. Henceforth, a pair of sets (A,B) together are known as a tile, and this type of calculation is called a tiled operation. Let T denote the total set of tiles (and the paired interactions it contains). We want to find a set of tiles T so that its set of interactions is the supremum of S. That is, minimize |T| such that the interactions in T still contains all of the interactions in S.

Here are several overall approaches towards the problem that will be illustrated.

Consider the following pair-wise list, where (i,j) denotes that the distance between atoms i and atom j is less than r:

 (1,2) (2,9) (4,5) (4,6) (5,8) (7,8) 
               Figure 0. 

Note that we don’t actually have this list in most cases unless we explicitly construct a neighbourlist, in reality we’re only given a list of positions of the atoms, and the cutoff radius r. In addition, as opposed to using a warp size of 32, we use a warp-size of 3 for illustration purposes.

A key constraint on all algorithms is that they do O(N \log N) work or better and ideally O(N) if possible.

1. Double Constraint (used by OpenMM):

    t A     0     1     2     (tiles)
    B     1 2 3 4 5 6 7 8 9   
      1 [ 0 1 0 0 0 0 0 0 0 ]
    0 2 [ x 0 0 0 0 0 0 0 1 ]
      3 [ x x 0 0 0 0 0 0 0 ]
      4 [ x x x 0 1 1 0 0 0 ] 
    1 5 [ x x x x 0 0 0 1 0 ]
      6 [ x x x x x 0 0 0 0 ]
      7 [ x x x x x x 0 1 0 ]
    2 8 [ x x x x x x x 0 0 ]
      9 [ x x x x x x x x 0 ] (x denotes symmetric pair)
              Figure 1. 

A_0 = \{1,2,3\}, A_1 = \{4,5,6\}, A_2 = \{7,8,9\}
B_0 = \{1,2,3\}, B_1 = \{4,5,6\}, B_2 = \{7,8,9\}
etc.
T=\{(A_0,B_0),(A_2,B_0),(A_1,B_1),(A_2,B_1),(A_2,B_2)\}

Note that (A_1,B_0) is empty so we don’t need to compute it.

The idea behind this is to identify the interacting tiles (ie. find 3x3s that don’t have a 1 in them). Note that identifying the actual tileset T is a non-trivial problem. Currently, we use a method in OpenMM called bounding boxes with a Hilbert Curve ordering (done once every 100 time-steps or so) to avoid having to explicitly generating a neighbourlist (Figure 0.). This is done every time-step, but its very fast. The result though is having to compute about 15% more tiles than needed. That is, we compare every block of continuous WARP atoms to every other block of WARP atoms.

Advantages:
-Very little additional memory requirement (since both dimensions are fixed)
-A single Hilbert ordering persists for ~100 timesteps, so we don’t to reorder on every timestep.
-Reads into memory are always coalesced along both dimensions.
Disadvantages:
-Computes roughly 10x more amount of interactions than needed.

2. Single Constraint (used by AMBER):

    t A     0             1             2     (tiles)
    B     1 2 3         4 5 6         7 8 9   
      1 [ 0 1 0 ]   4 [ 0 1 1 ]   5 [ 0 1 0 ]
    0 9 [ 0 1 0 ] 1 5 [ 1 0 0 ] 2 8 [ 1 0 1 ]
      3 [ 0 0 0 ]   8 [ 1 0 0 ]   9 [ 0 0 0 ]
                  Figure 2.

A_0 = \{1,2,3\}, A_1 = \{4,5,6\}, A_2 = \{7,8,9\}
B_0 = \{1,9,3\}, B_1 = \{4,5,8\}, B_2 = \{5,8,9\}
(B is unconstrained)
T = \{(A_0, B_0),(A_1,B_1), (A_2,B_2)\}

AMBER currently uses a neighbourlist with a skin to generate T.

Advantages:
-More compact, roughly 3-4x more calculations.
Disadvantages:
-Needs O(N) additional memory to store the non-constrained dimension (B).
-Only one dimension has coalesced reads.

3. No Constraints:

t A     0             1
B     1 9 5         6 7 1 
  2 [ 1 1 0 ]   4 [ 1 0 0 ]
0 4 [ 0 0 1 ] 1 8 [ 0 1 0 ]
  8 [ 0 0 1 ]   1 [ 0 0 0 ]
           Figure 3.

A_0 = \{1,9,5\}, A_1 = \{6,7,1\}
B_0 = \{2,4,8\}, B_1 = \{4,8,1\}
T = \{(A_0, B_0), (A_1,B_1)\}

Unexplored.
Advantages:
-Can probably be the most compact and provide the best supremum.
Disadvantages:
-Not sure of how to identify tiles and generate (A_i, B_j)
-Needs a lot more memory (though still O(N), since we only care about the interacting tiles.)

Tile Minimization: An Open Re-Ordering Problem

Suppose I had the following symmetric 9×9 matrix, N^2 interactions between N particles:

    (1,2) (2,9) (4,5) (4,6) (5,8) (7,8) 

These are symmetric interactions, so it implicitly implies that there exists:

    (2,1) (9,2) (5,4) (6,4) (8,5) (8,7) 

Suppose they are arranged in matrix form, where only the upper triangle is shown:

    t       0     1     2     (tiles)
      #   1 2 3 4 5 6 7 8 9   
      1 [ 0 1 0 0 0 0 0 0 0 ]
    0 2 [ x 0 0 0 0 0 0 0 1 ]
      3 [ x x 0 0 0 0 0 0 0 ]
      4 [ x x x 0 1 1 0 0 0 ] 
    1 5 [ x x x x 0 0 0 1 0 ]
      6 [ x x x x x 0 0 0 0 ]
      7 [ x x x x x x 0 1 0 ]
    2 8 [ x x x x x x x 0 0 ]
      9 [ x x x x x x x x 0 ] (x's denote symmetric pair)

I have some operation that’s computed in 3×3 tiles, and any 3×3 that contains at least a single 1 must be computed entirely. The above example requires at least 5 tiles with tile indices:

    (0,0) (0,2) (1,1) (1,2) (2,2) 

However, if I swap the 3rd and 9th columns (and along with the rows since its a symmetric matrix) by permutating my input:

    t       0     1     2
      #   1 2 9 4 5 6 7 8 3 
      1 [ 0 1 0 0 0 0 0 0 0 ]
    0 2 [ x 0 1 0 0 0 0 0 0 ]
      9 [ x x 0 0 0 0 0 0 0 ]
      4 [ x x x 0 1 1 0 0 0 ] 
    1 5 [ x x x x 0 0 0 1 0 ]
      6 [ x x x x x 0 0 0 0 ]
      7 [ x x x x x x 0 1 0 ]
    2 8 [ x x x x x x x 0 0 ]
      3 [ x x x x x x x x 0 ] (x's denote symmetric pair)

Now I only need to compute 4 tiles:

    (0,0) (1,1) (1,2) (2,2) 

The General Problem:

Given an NxN sparse matrix, finding an re-ordering to minimize the number of TxT tiles that must be computed. Suppose that N is a multiple of T. An optimal, but unfeasible, solution can be found by trying out the N! permutations of the input ordering.

What we currently use, by abusing the fact that the particles exist in Euclidean space, is to do a Hilbert curve ordering (which can be done in linear time). Another approach is to use something like a band minimization algorithm, such as the BFS-based reverse Cuthill-McKee. Since bandwidth minimization is NP-Complete, it seems to suggest problem might also be NP-complete.


Hilbert Curve:

Metis Nested Dissection:

Reverse CutHill McKee:

Approximate Minimum Degree Ordering:

@Stanford

I’m currently in PA writing some Python Wrappers for the RMSD code. Two functions will soon be made available: all against one rmsd and all against one lprmsd. Additional features will soon also include explicit recovery of the Euler rotation matrices. Wrapped in 5 hours using SWIG. It is currently being integrated into MSM Builder code with the help of Kyle Beauchamp, Robert McGibbon, Doug Theobald, and Peter Eastman.

A separate public version will also be soon made available for all you python hackers.

Alpha Preview!

I’m currently working an algorithm that can robustly fit periodic data. As shown below, the histogram is formed from the observed sample data, and the line denotes the obtained using the algorithm. GUI hacked together in QT in less than 6 hours.

Shoot me an e-mail if you’d like to test it out for your problem.