Image: SANTA FE INSTITUTE
Stuart Kauffman wears many hats. He is an entrepreneur who founded Bios Group, a Santa Fe-based software company where he is now chief scientific officer and chairman of the board, and co-founded Cistem Molecular in San Diego. He is an academic, with current posts as an external professor at the Santa Fe Institute and a professor emeritus at the University of Pennsylvania School of Medicine. And he is an author, having written numerous papers and three popular books (Origins of Order: Self-Organization and Selection in Evolution, At Home in the Universe and Investigations). But perhaps most of all, he is a visionary.
Indeed, Kauffman is among the pioneering scientists now mapping the intersection of computational mathematics and molecular biology--a burgeoning field known as bioinformatics. Ken Howard, a freelance writer based in New York City, recently sat down with Kauffman for Scientific American to discuss this relatively new discipline. An edited transcript of their conversation follows. Additional information on bioinformatics and the genome business in general will appear in the July issue of Scientific American.
SA: What is the promise of bioinformatics?
The completion of the human genome project itself is a marvelous milestone, and it's the starting gate for what will become the outstanding problem for the next 15 to 20 years in biology--the postgenomic era that will require lots of bioinformatics. It has to do with how do we understand the integrated behavior of 100,000 genes, turning one another on and off in cells and between cells, plus the cell signaling networks within and between cells. We are confronting for the first time the problem of integrating our knowledge.
SA: How so?
The fertilized egg has somewhere between 80,000 and 100,000 structural genes. I guess we'll know pretty quickly what the actual answer is. We're entitled to think of the, let's say, 100,000 genes in a cell as some kind of parallel processing chemical computer in which genes are continuously turning one another on and off in some vastly complex network of interaction. Cell signaling pathways are linked to genetic regulatory pathways in ways we're just beginning to unscramble. In order to understand this, molecular biologists are going to have to -- and they are beginning to -- change the way they think about cells. We have been thinking one gene, one protein for a long time, and we've been thinking very simple ideas about regulatory cascades called developmental pathways.
The idea is that when cells are undergoing differentiation, they are following some developmental pathway from a precursor cell to a final differentiated cell. And it's true that there's some sort of pathway being followed, but the relationship between that and the rolling change of gene activities is far from clear. That's the huge problem that's confronting us. So the most enormous bioinformatics project that will be in front of us is unscrambling this regulatory network. And it's not going to merely be bioinformatics; there has to be a marriage between new kinds of mathematical tools to solve this problem. Those tools will in general suggest plausible alternative circuits for bits and pieces of the regulatory network. And then we're going to have to marry that with new kinds of experiments to work out what the circuitry in cells actually is.
SA: Who is going to be working on this entire rubric? Is it bioinformaticians, or is it mathematicians or biologists?
All of the above. As biologists become aware of the fact that this is going to be essential, they are beginning to turn to computational and mathematical techniques to begin to look at it. And meanwhile we have in front of us the RNA chip data that's becoming available and proteomics as well. An RNA chip shows the relative abundance of transcripts from a very large number of different genes that have been placed on the chip from a given cell type or a given tissue sample. There are beginning to be very large databases of RNA chips that have expression data for tens of thousands of genes. For normal cells, diseased cells, untreated and treated normal and disease cells. Most of the data is a single moment snapshot. You just sample some tissue and see what it is doing. But we're beginning to get data for developmental pathways.
So you have a precursor cell that's triggered to differentiate by giving it some hormone. And you take staged samples of the thing as it differentiates, and we watch genes wax and wane in their activities during the course of that differentiation.
That poses a problem of what do you do with all that data. Right now what people are doing largely is cluster analysis, which is to say they take the chip data from a bunch of different cell types and try to cluster the genes that are expressed to a high level versus those that are expressed to a low level. And in effect that's simply a different way of recognizing cell types, but now at the molecular level. And there's nothing wrong with doing that, but there's nothing functional about it in the sense that if you're interested in finding out that gene A turns on gene B, and gene B -- when it's on -- turns off gene C. The way people are going about analyzing it doesn't lead in the direction of answering those questions.
SA: Why is that?
Because they're just recognizing patterns. Which is useful for diagnostic purposes and treatment purposes, but it's not the way to use the data to find what the regulatory circuits are.
SA: What could you do once you discover the circuitry?
First of all, you've just broadened the target range for the drug industry. Suppose a given gene makes an enzyme, then perhaps the enzyme's a nicely drugable target and you can make a molecule that enhances or inhibits the activity of the enzyme. But something you could do instead would be to turn on or off the gene that makes the enzyme. By finding the circuitry in and around a gene of medical interest, you've just expanded the number of drugable targets, so that you can try to modulate the activity of the genetic network rather than impinging upon the product of the gene.
Also, anything along the lines of diagnostics, if I know patterns of gene activities and regulatory circuitries, I can test to see the difference between a normal cell type and hepatic cancer cell -- a liver cancer cell. That is obviously useful diagnostically and therapeutically.
The biggest and longest-term consequences of all of this is uncovering the genetic regulatory network that controls cell development from the fertilized egg to the adult. That means that in the long run, we're going to be able to control cell differentiation and induce cell death, apoptosis. My dream is the following: 10 or 20 years from now, if you have prostatic cancer, we will be able to give drugs that will induce the cancer cells to differentiate in such a way that they will no longer behave in a malignant fashion, or they'll commit suicide by going into apoptosis. Then we'll also be able to cause tissue regeneration so that if you happen to have lost half of your pancreas, we'll be able to regenerate your pancreas. Or we'll be able to regenerate the beta cell islets in people who have diabetes. Afterall, if we can clone Dolly the sheep and make a whole sheep from a single cell, and if we now have embryonic stem cells, what we need are the chemical inductive stimuli that can control pathways of differentiation so that we can cause tissue to regenerate much more at will than we can now. I think that's going to be a huge transformation in biomedicine.
SA: What part does bioinformatics play in achieving this?
Most cancer cells are monoclonal; that means they're all derived from some single cell. And most cancer cells when they differentiate are leaky in the sense that they give rise to normal cell types as well as to cancer cells. This is a fact known to oncologists, which is not part of our therapeutic regimen right now. Cancer cells give rise to both normal and cancer cell types when the cancer stem cell, which is maintaining monoclonal line, is proliferating. What if we could take that cancer cell, give it chemical signals that induce it to differentiate into normal cell types -- we would be treating the cancer cell not by killing cells, but by using jujitsu on them and diverting them to be normal cell types. This already works with vitamin A and certain cancer cells, so there's already a precedent for it.
This is just one example of what we'll be able to do as we discover the circuitry and the logic -- and therefore the dynamical behavior -- of cells. We will know which gene we have to perturb with what, or which sequences of genes we have to perturb in what temporal order, to guide the differentiation of a cancer cell to nonmalignant behavior or to apoptosis. Or to guide the regeneration of some tissue. I can imagine the time when we'll be able to regenerate cardiac tissue from surrounding normal tissue instead of having a scar in place, and the scar serves as the focus for getting recurrent electrical activity in the heart, sending up little spirals of electrical activity, which make your heart beat unstable and which makes people post-heart attack subject to sudden cardiac death because they go into ventricular fibrillation.
Suppose what we could do instead of getting scar tissue, suppose we could get those cells to differentiate into perfectly normal myofibrils. Nothing says we can't do that since the muscle cells and fibrotic tissue are cousins of one another developmentally. So you could begin to imagine treating all kinds of diseases by controlling cell differentiation, tissue differentiation and so on. And to do that we're going to have to know what the circuitry is, and we're going to have to know what small molecules or molecules in general can be added to a person that will specifically treat the diseased tissues and not have undue side effects.
SA: How does complexity theory, disorganization/self-organizing systems, come into play? How do computers and algorithms and data from many different places need to be integrated?
There are three ways we will understand genetic regulatory networks, all of which involve computational work, as well as piles of data. One of them has already been pioneered. It is the following: I have a small genetic circuit, for example, bacteriophage lambda -- or something like that, which has 20 or 30 genes in it and one major switch. And I know all of the genes; I know which genes make which products, which bind to which genes; I know the binding constants by which those gene products bind to the gene. And what I do is, I make in effect an engineering model of that specific circuit, sort of like electrical engineering, except it's molecular-biology-chemical engineering to make a specific circuit for that bacterium. One is inclined to do the same thing with the human genome.
Suppose I pick out 10 genes that I know regulate one another. And I try to build a circuit about their behavior. It's a perfectly fine thing, and we should do it. But the downside is the following: those 10 genes have inputs from other genes outside that circuit. So you're taking a little chunk of the circuitry that's embedded in a much larger circuit with thousands of genes in it. You're trying to figure out the behavior of that circuit when you do not know the outside genes it impacted. And that makes that direct approach hard because you never know what the other inputs are. Evidence that it's hard comes from a parallel, looking at neural circuits, at neural ganglia. We've known for years what every neuron is in, say, the lobster gastric ganglia; what all of the synaptic connections are; what the neurotransmitters are; and you have maybe 13 or 20 neurons in the ganglion, and you still can't figure out the behavior of the ganglion. So no mathematician would ever think that understanding a system with 13 variables is going to be an easy thing to do. And we want to do it with 100,000 variables. That scales the problem.
Molecular biologists have thought they're going to be able to work out how 100,000 genes work with one another without having to write down the mathematical equations by which genes govern one another, and then figuring out from that what the behavior is. That's why this is a stunning transition that we're going through, and there's a lot of stumbling around going on. It's because molecular biologists don't know any mathematics, by and large.
The second approach to this problem is one that I've pioneered, and it's got great strengths but great weaknesses. It turns out that if you want to model genes as if they're little lightbulbs, which they're not, but if you want to model them that way, then the appropriate rules that talk about turning genes on and off are called Boolean functions. If you have K inputs to a gene, there's only a finite number of different Boolean functions.
So what I started doing years ago was wondering if you just made genetic networks with the connections at random with the logic that each gene follows is assigned at random, would there be class of networks that just behaved in ways that looked like real biologic networks? Are there classes of networks where all I do is tell you all I know about the number of inputs, and some biases on the rules by which genes regulate one another -- and it turns out that a whole class of networks, regardless of the details, behaves with the kind of order that you see in normal development?
SA: How would you identify the different networks as one class or another?
There are two classes of networks: one class behaves in an ordered regime and the other class behaves in a chaotic regime, and then there is a phase transition, dubbed the edge of chaos, between the two regimes. The ordered regime shows lots and lots of parallels to the behavior of real genetic systems undergoing real development. To give an example, if I make a network with two inputs per gene and that's all I tell you, and I make a huge network, with 50,000 or 100,000 genes, and everybodys got two inputs, but it's a scrambled mess of a network in terms of the connections -- it's a scrambled spaghetti network, and the logic assigned to every gene is assigned at random, so the logic is completely scrambled -- that whole system nevertheless behaves with extraordinary order. And the order is deeply parallel to the behavior of real cell types.
Even with 10 inputs per gene, networks pass from the chaotic regime into the ordered regime if you bias the rules with canalizing functions. The data is very good that genes are regulated by canalizing functions. There is one caveat. It could be that among the known genes that are published, it's predominantly the case that they are governed by canalizing functions because such genes have phenotypic effects that are easy to find, and there's lot of things that are noncanalizing functions, but you just can't find them easily genetically. So one of the things that we'll have to do is take random genes out of the human genome or the yeast genome or the fly genome and see what kind of control rules govern them.
SA: What would be the implication if it did turn out that most were governed by canalizing functions?
What we've done is made large networks of genes, modeled genes mathematically, in which we've biased the control rules to ask whether or not such networks are in the ordered or chaotic regime, and they are measurably in the ordered regime. The implication is that natural selection has tuned the fractions of genes governed by canalizing functions such that cells are in the ordered regime.
The way you do this is you make an ensemble of all possible networks with the known biases, number of input per genes and biases on the rules. And you sample thousands of networks drawn at random from that ensemble, and the conclusions that you draw are conclusions for typical members of the ensemble. This is a very unusual pattern of reasoning in biology. It's precisely the pattern of reasoning that happens in statistics with things like spin glasses, which are disordered magnetic materials. The preeminent place it has been used is in statistical physics. The weakness of this ensemble approach, this ensemble of networks, is that you can never deduce from it that gene A regulates gene F because you're making statistical models. The strength is that you can deduce lots of things about genetic regulatory nets that you can't get to if you make little circuits and try to do the electrical engineering approach.
In the simplest case in these model networks, there's a little central clock that ticks; that's not true for real cells, but it will do for right now. Every gene looks at the state of its inputs and it does the right thing. So let me define the state of the network as the current on and off values of all 100,000 genes. So how many states are there? Well, there are two possibilities for gene 1 and two possibilities for gene 2 and so on, so there's 2 100,000, which is 10 30,000, so we're talking about a system in the human genome, even if we treat genes as idealized as on or off -- which is false because they show graded levels of activity -- it's got 1030,000 possible states. It is mind-boggling because the number of particles in the known universe is 10 80.
Here's what happens in the ordered regime. At any moment in time, the system is in a state, and there's a little clock; when the clock ticks, all the genes look at the states of their inputs and they do the right thing, so the whole system goes from some state to some state, and then it goes from state to state along a trajectory. There's finite number of states; it's large, but it's finite. So eventually the system has to hit a state it's been in before, and then it's a deterministic system; it'll do the same thing. So it will now go around a cycle of states. So the generic behavior is a transient that flows into a cycle.
The cycle is called a state cycle or an attractor. What I did 30 years ago was ask, "What's a cell type?" And I guessed that cell types were attractors, because otherwise we'd have 1030,000 different cell types, and we have something like 260. So here's what happens in the ordered regime. The number of states on the state cycle tends to be about the square root of the number of genes. The square root of 100,000 is around 318. So this system with 1030,000 states settles down to a little cycle with 300 states on it. That's enormous order. The system had squeezed itself down to a tiny black hole in its states phase.
If you're on one of these attractor state cycles and you perturb the activity of a single gene, like if a hormone comes in, most of the time you come back to the same attractor, so you have homeostasis. Sometimes, however, you leave one state cycle and you jump onto a transient that goes onto another state cycle, so that's differentiation.
All the things I'm telling you are testable things about the human genome. And they are predictions about the integrated behavior of the whole genome, and there's no way of getting to that right now without using the ensemble approach. Theyre very powerful predictions. That's the strength of it. The weakness is it doesn't tell you that gene A regulates gene F, which of course is exactly one of the things that we want to know.
SA: What does it buy you?
There's all sorts of questions you can answer using the ensemble approach that we will not be able to do until we have a complete theory of the genome.
SA: What then is the next step to take? Take the data from the genome and plug it into these models?
Yes, in the sense that you can do experiments to test all of the predictions of these kinds of ensemble models. Nothing prevents me from cloning in a controllable promoter upstream from 100 different randomly chosen genes in different cell lines, perturbing the activity of the adjacent gene and using Affymetrix chips to look at the avalanches of changes of gene activity. All of that is open to be tested.
We should be able to predict not only that it happens but the statistical distribution of how often when you do it cell type A goes back to being cell type A and how often cell type A becomes cell type B. Everything here is testable. In the actual testing of it for real cells, we'll begin to discover which perturbation to which gene actually causes which pathway of differentiation to happen. You can use molecular diversity or combinatorial chemistry to make the molecules with which you do the perturbation of cells and then test the hypotheses we've talked about.
The aim is to try and find means to either change the abundance of a given gene transcript to treat a disease or to cause differentiation. I have more than one friend who either had or has cancer and our methods for treating cancer are blunderbuss, really idiotic, even though they are much more sophisticated than they used to be. We're just killing dividing cells. What if we could get to where we could direct cells to differentiate? It's huge in its practical importance if we could make that happen.
SA: Is bioinformatics the tool to integrate the computational work and the wet work?
Bioinformatics has to be expanded to include experimental design. What we're going to get out of each of these pieces of bioinformatics is hypotheses that now need to be tested. And it helps you pick out what hypothesis to go test. And the reason is we don't know all 100,000 genes and the entire circuitry. Even if we knew the entire circuitry, as we do for ganglia in the lobster gut -- people having been working for 30 years to understand how the lobster gut ganglia work, even knowing all the anatomical connections. So it isn't going to be easy.
I think the greatest intellectual growth area will come with the inverse problem. The point is the following: I show you the Affymetrix chips of differing patterns of gene expression and you tell me from that data what gene actually regulates what gene by what logic. That's the inverse problem. I show you the behavior of the system and you deduce the logic and the connections among the genes.
SA: Do you see being able to do in silico work for the entire human body or various circuits anytime soon, or ever?
Yes, I do. I think our timescale is 10 to 15 years to develop good models of the circuitry in cells because so much of the circuitry is unknown. But before I can make the model and explore the dynamical behavior of the model, I can either use the ensemble approach, which I've used, or I actually have to know what the circuitry is. There are three approaches for discovering the circuitry. One is purely experimental, which is what molecular biologists have been doing for a long time.
SA: How accurate is it? How testable is it?
It works for small networks, for synchronous lightbulb networks. Real cells aren't lightbulbs; they're graded levels of activity and they're not synchronous, so it's a much harder problem to try and do this for real cells for a variety of reasons. First of all, when you take a tissue sample, you don't have a single cell type, you typically have several cell types in it. A lot of the data that's around has to do with tissue samples because it's from biopsy data. Second, most of it is single-moment snapshots rather than a sequence of gene activities along a developmental pathway. That's harder data to get.
It's beginning to become available. It is from the state transition that we can learn an awful lot of the data about which genes regulate which genes. There are other potentially powerful techniques that amount to looking at correlated fluctuations in patterns of gene activity and trying to work out from those which genes regulate which genes by what rules. The bottom line of such inverse problem efforts is that the algorithms are going to come up with alternative possible circuits that would explain the data. And that then is going to guide you to ask what's the right next experiment to do to reduce your uncertainty about what the right circuit is. So the inverse problem is going to play into the development of experimental design.
SA: You start with microarray experimentation?
You go in saying, "We think gene A regulates gene F." And now you say, "If that's true, if I perturb the activity of gene A, I should see a change in the activity of gene F; specifically if I turn gene A on, I should turn gene F on." So now you go back and find a cis site and a trans factor, or a small molecule when added to the cell will turn gene A on, and then you use an Affymetrix chip or some analogue of that to say, "Did I just turn on gene F?"
SA: What is the practical application?
We've just expanded the set of drugable targets from the enzyme itself to the circuitry that controls the activity of the gene that makes the enzyme.
SA: What is bioinformatics' role in this entire enterprise?
Let's take the inverse problem. The amount of data that one is going to need from real fluctuating patterns of gene expression and the attempt to deduce from that which gene regulates which gene by what logic -- that's going to require enormous computational power. I thought the problem was going to be what's called NP-hard, namely exponentially hard, in the number of genes. I now think it's not; I think it's polynomially hard, which means it's solvable or it's much more likely to be solvable in the number of genes.
The real reason is the following: Suppose that any given gene has a maximum of somewhere between one and 10 inputs. Those inputs, if you think of them as just on or off, can be in 210 states; they can all be on or can all be off or any other combination. Well, 210 is 1,000. That's a pretty big number. But it's small compared to 1030,000. Since most genes are regulated by a modest number of other factors, the problem is exponential in the number of inputs per gene, but only polynomial in the number of genes. So we have a real chance at cracking the inverse problem. I think it's going to be one of the most important things that will happen in the next 15 years.
SA: Is the inverse problem a true problem or one method to get information?
It's a true problem. The direct problem is that I write down the equations for some dynamical system and it then goes and behaves in its dynamical way. The inverse problem is you look at the dynamical behavior and you try and figure out the laws. The inverse problem for us is we see the dynamical behavior of the genome displaying itself on Affymetrix chips or proteomic display methods, like 2-D gels. And now we want to deduce from that what the circuitry and the logic is. So it's the general form of a problem. It's the way to try and get out which genes are regulating which genes, so that I know not just from my ensemble approach, but I know that gene A really is regulating gene F.
SA: What are the barriers to figuring out the inverse problem? Is it computer power? Designing the proper algorithms? Biomolecular understanding?
All three. Let's take a case in point: Feedback loops make it hard to figure things out. And the genome is almost certainly full of feedback loops. For example, there are plenty of genes that regulate their own activity. So figuring out the algorithms that will deal with feedback loops is not going to be trivial. The computing power gets to be large because if I want to look for a single input, like a canalizing input, the canalizing input is really easy to tell because if gene A is on, then gene C is on no matter what. So all I have to do is examine a lot of gene expression data and I can see whenever A is on, now C is on a moment later.
I can do that by looking at things one gene at a time. But suppose I had a more complicated rule in which two genes had to be in a combined state activity to turn on gene C. To do that, I have to look at genes pair-wise to see that they manage to regulate gene C. If I looked at A alone or B alone, I wouldn't learn anything. So now if I have 100,000 genes, I've got to look at 100,0002 pairs; that's 1010 pairs. Now what if I have a rule that depends on the state of activity of three genes to turn the gene on, then I have to look at 1053, which is 1015, and that's probably about the limit of the computing power that we've got now.
But that leaves out the fact that we don't have to be stubborn about it; we can always go do an experiment. And so this now ties to experimental design. Notice that all of these problems lead in the direction of new experimental designs, and what we're going to have to do is to marry things like the inverse problem to being able to toggle the activity of arbitrary genes.
SA: What about the molecular biology knowledge?
It's going to take a lot of biological knowledge. For example, let's suppose that every structural gene has at least one cis site that regulates it. Then there's 100,000 cis sites. Nobody knows if that's true, but let's pose that. Well, we have an awful lot of work to do to pull out all the cis sites. Now let's suppose that 5 percent of the structural genes that are around act as regulatory inputs to the structural genes. So there's on the order of 4,000 to 5,000 trans factors making up this vast network that we're talking about.
Well, we have to discover what those trans factors are; we have to discover what the cis sites are; we have to discover what the regulatory logic is. Then we have to make mathematical models of it. Then we have to integrate the behavior of those mathematical models. And then we're going to run into the same problems that people have looking at the gut ganglion in lobster -- that even though you know all the inputs, figuring out the behavior is going to be hard. Then we're going to run into the problem that people have looking at the gut ganglion in lobster -- that even though you know all the inputs, figuring out the behavior is going to be hard. Then we're going to run into the problem that you're looking at a circuitry with 40 genes in it, but there are impacts coming in from other genes in the 100,000-gene network that are going to screw up your models. So, this aint going to be easy.
SA: In terms of a timeline, are we looking at one year or 200 years away?
I think 30 to 40 years from now we will have solved major chunks of this. The tools will mature in the next 10 to 12 years, and then we'll really start making progress.
SA: What do you define as progress?
That we will be getting the circuitry for big chunks of the genome and actually understanding how it works. Getting to the genomic sequences is wonderful, but what does it tell you about the circuitry? So far, nothing -- except who the players are.
SA: So we're at the beginning?
We're at the very beginning of postgenomic medicine. But the payoff is going to be enormous. There's going to be a day 30 years from now where somebody comes in with cancer and we diagnosis it with accuracy not just on the morphology of the cancer cell but by looking at the detailed patterns of gene expression and cis site binding activities in that cell. And we know the circuitry and the autocrine perturbations to try, or we know which gene activity perturbations to try that will cause that cell to differentiate into a normal cell type or cause that cell to commit hara-kiri.
SA: Will that be someone walking into their doctor's office, the doctor turning on the computer and just entering the data?
It will require being able to do the RNA sample. Biotech companies together with big pharma, which alone has the money to get things through clinical trials, will wind up proving that you can treat cancer this way -- or treating some degenerative disease of your joint, for example, where we regenerate the synovium. Why not? We can make an entire sheep -- why can't we regenerate the synovium?