Tuesday, September 22, 2009

Biocomputers: Fault Tolerant Systems

An underappreciated distinction between software engineering, and all other fields of engineering: in software, the blueprints are the final product.

If I build a car, a scanner, a hard drive, anything that has a physical manifestation, then I run the risk of a manufacturing defect.  Most devices go bad after a time.  However, the blueprints do not.  In most ways, this is an advantage for software: it is limited in the ways it can decay.  Software defects are limited to their blueprints (code) and their operating environment.

In one way way, however, this gives hardware an advantage: fault-tolerance through duplication.  If you have two of any given device, chances are that one of them will behave as planned.  And if you can identify when hardware behaves unexpectedly, then you can replace it before it affects the system.  RAID works with disks.  It does not with software.

By conventional design, software is not fault tolerant.  You can spawn a hundred copies of your thread, but if your logic is incorrect, chances are that all hundred will fail.  Attempts have been made to build fault-tolerant software before.  The Guardian Tandem/16 would keep backup processes to run if primary processes fail, restoring a checkpoint and running from there.  While this mechanism could work for some types of faults, in reality, this is transaction management, not a full-functional backup.  Again, a logic error in one thread would simply be repeated in the backup.

So what would true software backup systems look-like?

One could take the military approach, in which three teams must independently reimplement the same functionality, with the end action being a consensus of three unrelated code bases. This substantially increases the reliability of code, but it also requires three times the effort.  A similar strategy is frequently employed by developers: test-driven development, in which the accuracy of the executing code is verified by another set of test code.  But, though this drastically helps code reliability, it isn't exactly fault tolerance.  Additionally, use of try/catch blocks in code can help identify, prevent, and sometimes fix problems, but it lacks the kind of hot-swap easy fix capability of well-designed fault-tolerant hardware.

Can complete fault tolerance exist in software without a doubling of design effort?  It depends upon the kind of faults one is looking for.  Many distributed algorithms can prevent problems that arise by thread crashes or competition over resources.  In order to support complete software fault-tolerance, a completely different approach to computing is required.  It is a field barely blossoming in 2009, but by the 2060s becomes critical to many high-end, computationally complex systems: biocomputation.

The human brain is a type of computer with heavy levels of fault tolerance.  It frequently makes processing mistakes (think of stubbed toes, misspoken words, phantom cell phone rings), but generally recovers from these.  How does it do this?  To explore this question, today we interview Dr. Larry Gernsback, head of the Neurocomputing Design Division in Qx Machinations (founded by two former Intel engineers and an MIT neuroscientist in the 2040s).

Kurt: Dr. Gernsback, you're involved in a field that would be foreign to most of our readers in 2009.  How does a neurocomputer differ from a standard digital computer?
Gernsback: Well, neurocomputers and biocomputers operate quite differently from classical digital and quantum computers.  The most familiar technology might be something along the lines of a highly-networked analog control system.
Kurt: I'm not sure I understand, please clarify?
Gernsback: In a traditional computer program, an operation is specified, it is compiled, and it runs exactly as it was told to do, in any given environment.  A loop will behave the same no matter where in the code it is placed.  In a biocomputer, there are two primary differences: first, no code can operate independently of its execution environment.  All functions must be fed with extensive data regarding the state of the area of the system it well be placed in.  Second, the code isn't compiled into machine instructions, but rather into probabilistic control mechanisms.  In effect, the code becomes the summation of a set of highly-conditional input-output matching.  Its effectively like performing a Fourier transform on discrete data.  Rather than being a signal defined by a single function, it becomes defined by a large series of additive, simpler functions.  So when you write a method, it is compiled into a number of regulating methods, each listening to different system state information, and modifying different system state information.  The cumulative effect is the desired functionality.
Kurt: So what benefit does this have?
Gernsback: Well, the primary consequence is that everything in the system affects everything else.  Now, this might sound inherently unstable, but if designed properly, it is quite the opposite.  A function will behave predictably when placed into a predictable environment; when anomalies occur in the system, they modify the global variables that all sorts of functions are listening for.  These functions in turn regulate those variables, returning them to normalcy.  Everything acts as a controller for everything else, meaning that parts with only some functional overlap will actually mutually enhance each others' reliability.
Kurt: Most fascinating.  So besides reliability, what does one gain from such functionality?
Gernsback: Well, it fundamentally changes the way in which one designs the system.  The "program" itself will seek out order, self-organize around the data feed into it.  This makes some classes of data analysis, security, even user interaction entirely different.  Rather than programming the internals of a feature, you mathematically define the desired outputs, and then feed the system extensive sets of simulated data.  New regulating functions will emerge to adapt to this data, modifying the program in such a way that it produces the desired output.
Kurt: So, the software is evolved, almost literally?
Gernsback: Precisely.  And with that design paradigm, one gets all sorts of other gains.  The program can evolve in multiple different directions in virtual environments, and even "mate" with each other.  It enables you to run the genetic algorithm on programs themselves, enabling you to rapidly produce a variety of possibilities, and select the desired one.  It costs almost the same to produce 100 unique programs that meet your requirements as it is to produce 1, fundamentally changing the way one approaches the system.  A good developer knows how to guide the evolution of a product, more like a botanist than a traditional engineer.  We have a joke around the office that software that doesn't behave as desired only needs to be "domesticated".  I think its an apt term, in some respects.
Kurt: I'll admit, a lot of this discussion went over my head, but the structure of it I find extremely fascinating.  And I'll have you know, Dr. Gernsback, that your work is only the beginning.

Okay.  At this point in time, we have C++ and some basic calculators implemented in DNA.  We don't have extensively self-modifying programs.  What we have is redundancy and transaction management.  This is powerful, and this will be the dominant paradigm for some time to come.  But true, complete, software-level fault tolerance is currently profoundly laborious to implement.  And until a program can dynamically modify itself, the most effective duplication is a duplication of effort.

No comments:

Post a Comment