************************************************************
Infection of biological DNA with digital Computer Code
by Second Part To Hell
************************************************************
Biological life spreads in the biological-chemical world, computer codes
can spread in the digital computer world. That is a rule - no self-
replicator has ever overcome the digital-biological barrier. Until today.
Here I show a method how a digital computer code can infect biological
DNA, thus spread in the biological-chemical world. The method is mainly
based on the fantastic research by the J. Craig Venter Institute on
synthetic life, and might ask new questions about the definition of
life itself.
1) Introduction
2) Craig Venter's synthetical cell
2.1) General concept
2.2) Watermarks and DNA encoding language
3) FASTA file format
3.1) plain fasta-files
3.2) xml fasta-files
4) Infection scenario
4.1) Stealth technique in DNA
5) Conclusion
1) Introduction
In 2010, the J. Craig Venter Institute (JCVI) reported the creation of a
bacterial cell with a chemically synthesized genome [1]. They sequenced
the DNA of a bacteria (M.mycoides), modified several parts of its DNA in
the computer, synthetized the novel genome and transplanted it to a
different bacteria's cell (M.capricolum). They observed the control of
the cell only by the new DNA. For verification, they introduced
artificial "watermarks" sequences (non-coded part of the DNA) to the
genome, which contained among other things the names of the involved
scientists (written in a specially designed DNA encoding alphabet). The
artificially created genome was capable of continuous self-replication.
They call their new artificial bacterial Mycoplasma mycoides JCVI-syn1.0.
This is in my opinion one of the greatest scientific achievement in
recent years.
In this text I explain the implementation of a computer code that makes
the step from the digital to the biological world.
The computer code, written in C++, hosts the DNA sequence of M.mycoides
JCVI-syn1.0. At runtime it acts as follows:
1) Preparing the DNA sequence of M.mycoides JCVI-syn1.0 in the memory,
(with slightly modified watermarks).
2) Encoding own file-content in base32. The base32 code is then encoded in
JCVI's DNA-encoded alphabet.
3) This representation of its digital form is then copied to a
watermark of the bacteria's genome in memory. With this, a fully
functional bacterial DNA sequence including the digital code is
generated.
4) Next it searches for FASTA-files on the computer, which are text-based
representations of DNA sequences, commonly used by many DNA sequence
libraries.
5) For each FASTA-file, it replaces the original DNA with the bacterial
DNA containing the digital form of the computer code.
The code has a classical self-replication mechanism as well, to eventually
end up on a computer in a microbiology-laboratory with the ability of
creating DNA out of digital genomes (such as laboratories by the JCVI).
If the scientists are incautious, the computer code's genome (instead of
the intented original DNA) might be written to the biological cell.
The new cell will start replicating in the biological world, and with it
the representation of the digital computer code.
2) Craig Venter's synthetical cell
2.1) General concept
The team of Craig Venter has demonstrated how to create bacteria
controlled by artificially designed and synthesized DNA. For that,
they used the sequenced DNA of a ~1 mega-base pair bacteria
M.mycoides. They modified the genome on the computer - deactivated
several genes, and introduced watermarks (artificial non-coding
parts of the DNA). A company called Blue Heron sequenced 1000 bp
fragments of the full DNA. With a three-step procedere, they assembled
the full DNA. This was transplanted into an empty receiver cell of the
bacteria M.capricolum.
Amazingly, the cell with the new genom booted up, and was able to
self-replicate. To verify that the expected genome was replicating,
they introduced special functionality to the watermarks which are
visible with chemical methods.
In their article [1] they write:
"This work provides a proof of principle for producing cells
based on computer-designed genome sequences. DNA sequencing
of a cellular genome allows storage of the genetic instructions
for life as a digital file."
The project describe here uses the method of their proof-of-principle.
2.2) Watermarks and DNA encoding language
The watermarks are parts of the genome that are not translated into
functional proteins. That means: They are part of the DNA, but have
no functional effect on the behaviour of the cell.
The watermarks are represented by nucleotides A,C,G,T. JCVI
developed an encoding technique from DNA to human letters. Three
nucleotides (one codon) represent one letter or ascii symbol. With
that encoding methode, they encode readable information into the
cell: It contains the name of the involved scientists, philosophical
quotes and one html-code with an e-mail adresse.
The encoding from codons to letters has never been documented
explicitly, but can be deduced mainly from the implicit information
given in the article. The known alphabet looks like this:
TAG = a GCA = k TCC = u AGA = 4 CAC = /
AGT = b AAC = l TTG = v GCG = 5 CCA = =
TTT = c CAA = m GTC = w GCC = 6 CGA = .
ATT = d TGC = n GGT = x TAT = 7 GAG = !
TAA = e CGT = o CAT = y CGC = 8 CAG = :
GGC = f ACA = p TGG = z GTA = 9 GGA = "
TAC = g TTA = q TCT = 0 ATA = space GTG = ,
TCA = h CTA = r CTT = 1 GGG = chr(10) TCG = @
CTG = i GCT = s ACT = 2 AGC = > CCC = -
GTT = j TGA = t AAT = 3 CGG = <
Four watermarks have been introduced to the modified bacterial DNA
in the computer.
As an example, a part of the DNA sequence of one watermark is:
GCTTAATAAATATGATCACTGTGCTACGCTATATGCCGTTGAATATAGGCTATATGATC
ATAACATATATAGCTATAAGTGATAAGTTCCTGAATATAGGCTATATGATCATAACATA
TACAACTGTACTCATGAATAAGTTAACGA
The sequence is divided into three-nucleotide parts (codons):
GCT TAA TAA ATA TGA TCA CTG TGC TAC GCT ATA TGC CGT TGA ATA
TAG GCT ATA TGA TCA TAA CAT ATA TAG CTA TAA GTG ATA AGT TCC
TGA ATA TAG GCT ATA TGA TCA TAA CAT ATA CAA CTG TAC TCA TGA
ATA AGT TAA CGA
We can see in the above list that GCT stands for "s", TAA stands for
"e", ATA is a space, TGA stands for "t" ... and so on.
In the end we can extract the sentence:
"see things not as they are, but as they might be."
Obviously we can also write in this encoding technique:
"hello vxers!" ->
TCA TAA AAC AAC CGT ATA TTG GGT TAA CTA GCT GAG
The full structure of the alphabet is not known ,therefor only 49 out
of 64 codon's representation are presented here. However all of them
are used in the watermark (i.e. there is no biological reason for not
using specific codons).
3) FASTA file format
Fasta files are textbased representations of nucleotide sequences, commonly
used in micro-biologic libraries. There are two fasta-file types that I
will describe here. The first one is plain fasta-format (which usually have
the file-extention .fasta or .fas.
Both are available from the genome-database
http://www.ncbi.nlm.nih.gov/.
For example, if you want to see the DNA of Mycoplasma mycoides JCVI-syn1.0:
http://www.ncbi.nlm.nih.gov/nuccore/296455217
or something more common: E.coli
http://www.ncbi.nlm.nih.gov/nuccore/BA000007.2
3.1) plain fasta-files
The plain fasta-files have a small header, followed by a plain
representation of the DNA in the nucleotide basis (A, T, G, C).
Two examples:
a) Mycoplasma mycoides JCVI-syn1.0
This is about 1MB of data
- - - - - - - - Mycoplasma mycoides JCVI-syn1.0.fasta - - - - - - - -
>gi|296455217|gb|CP002027.1| Synthetic Mycoplasma mycoides JCVI-syn1.0 clone sMmYCp235-1, complete sequence
ATGAACGTAAACGATATTTTAAAAGAACTTAAACTAAGTTTAATGGCTAATAAAAATATTGATGAATCCG
TGTATAACGACTATATAAAGACAATAAATATTCATAAAAAGGGGTTTTCTGATTATATTGTTGTTGTTAA
ATCACAATTTGGTTTGTTAGCTATAAAACAGTTTCGTCAAACTATTGAAAATGAGATAAAAAATATTTTA
AAAGAACCTGTAAATATTAGTTTTACATACGAACAAGAATATAAAAAACAACTAGAAAAAGATGAATTAA
TTAATAAAGATCATTCTGATATCATTACTAAAAAAGTTAAAAAAACTAATGAAAACACTTTTGAAAATTT
...
- - - - - - - - Mycoplasma mycoides JCVI-syn1.0.fasta - - - - - - - -
b) Escherichia coli
This is about 5.5MB of data
- - - - - - - - - - - - - - - E.coli.fasta - - - - - - - - - - - - - - -
>gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTCTGACAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCACCACCATCACCATTACCATTACCACAGGTAACGGTGCGGGCTGACGCGTAC
AGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTCGACCAAAGGTAACGAGGTAACA
...
- - - - - - - - - - - - - - - E.coli.fasta - - - - - - - - - - - - - - -
3.2) xml fasta-files
The second form is pure DNA aswell, however in a small xml-file. Two
examples again:
- - - - - - - - Mycoplasma mycoides JCVI-syn1.0.fasta.xml - - - - - - - -
296455217
CP002027.1
766747
synthetic Mycoplasma mycoides JCVI-syn1.0
Synthetic Mycoplasma mycoides JCVI-syn1.0 clone sMmYCp235-1, complete sequence
1078809
ATGAACGTAAACGATATTTTAAAAGAACTTAAACTAAGTTTAATGGCTAATAAAAATATTGATGAATCCGTGTATAACGACTATATAAAGACAATAAATATTCATAAAAAGGGGTTTTCTGATTATATTGTTGTTGTTAAATCA...
- - - - - - - - Mycoplasma mycoides JCVI-syn1.0.fasta - - - - - - - -
or E.coli again:
- - - - - - - - - - - - - - - E.coli.fasta.xml - - - - - - - - - - - - - - -
47118301
BA000007.2
386585
Escherichia coli O157:H7 str. Sakai
Escherichia coli O157:H7 str. Sakai DNA, complete genome
5498450
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTCTGACAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATA...
- - - - - - - - - - - - - - - E.coli.fasta.xml - - - - - - - - - - - - - - -
4) Infection scenario
The strategy of this digitally and biologically self-replicating code is
the following:
It starts as a digital computer file, and replicates itself via local
networks, USB sticks and other removeable devices.
There are two potential scenarios to step from the digital to the
biological world:
1) The self-replicating code might end up at a USB stick of a
microbiologist. (S)he runs it unintentionally at a computer that host
DNA sequences (stored in the common fasta-file format) which will be
synthesized and transplanted to receiving cells (such as in the labs of
JCVI). The computer code will find these fasta-files and replace their
DNA sequences with the bacterial genome of M.mycoides. This genome
contains a watermark with the DNA-representation of the file-content of
the computer code. When the DNA files are synthesized, the computer code
is synthesized aswell, and will continuously self-replicate in the
biological world in the form of a bacteria.
2) In this scenario, the code gets to the computer of a genome library
(such as NCBI, National Center for Biotechnology Information).
The computer code will search for FASTA files and replace their DNA
content with its own DNA code. The employee will unintentionally upload
the computer code's DNA instead of the original DNA.
Then - back in a laboratory like that from JCVI - scientists will
download the modified DNA sequence. When they synthesize the wrong DNA
sequence, the computer code lands in the bacterial cell again, again
capable of continuously self-replicate in the biological world.
There is a different interesting scenario: First, Mycoplasma mycoides
bacteria are usually infecting cattles and goats. Imagine an unknown
outbreak of the here presented bacteria. Goats or cattles would get sick,
and microbiologists want to know the exact reason. They take samples of
the infectious cells and sequence them in their laboratories.
Now they see the DNA, and find out that the bacteria contains a rather
big non-coded sequence - the watermark. They find this very unnatural and
analyse the watermark, also by applying Craig Venter's DNA encoding
alphabet (because it is very famous due to their first fascinating
results). After decoding, they see that the code only contains
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,2,3,4,5,6,7
This is a curious structure, they research a bit and see that its base32-
encoding. They decode it, and see 'M','Z',0x90,0x0,...
They immediatly see that its a windows executeable, and I guess would be
surprised :)
4.1) Stealth technique in the DNA
In their modified genome of M.mycoides JCVI-syn1.0, the JCVI-team
introduces four watermarks. Every watermark contains a special
sequence which is useed to test whether a cell has the intended
genome or is a contamination (for example, from the receiving cell).
In the supplementary material of their article [1], they describe the
exact representation of these sequences (primer). Each of the four
watermarks contain one primer. When they perform a multiplex PCR, each
watermark creats a specific characteristic.
In my code, I removed the total original content of all watermarks,
except for the identified primer-sequence. As a result, when a team
tests the bacteria cell with the representation of the digital code,
it will have the same characteristic as their original designed DNA.
Thus the computer code's DNA will pass this test.
5) Conclusion
I've shown the implementation of a technique that allows a digital computer
code to make the step to the biological world. This is done by infecting a
DNA-file with the genome of a self-replicating biological bacteria. The
bacteria's genome contains the digital code of the self-replicator in form
of a base32-representation encoded via Craig Venter's DNA encoding
alphabet.
The biological bacteria will self-replicator in the biological world, and
so will the representation of the digital computer code.
The outbreak-probability of such cross-domain infectors is very low. The
researchers in [1] have made ethical studies, and I'm convinced that they
came up with perfect protections against potential attacks as this.
Finally, digital self-replicators are usually not considered as a form of
life, even they fulfill the most important characteristic of life:
capability of self-replication and subject to evolution [2].
I wonder whether this computer code can count as a form of life - if so, I
would call it
Mycoplasma mycoides SPTH-syn1.0
:)
Second Part To Hell
October 2013
http://spth.virii.lu/
sperl.thomas@gmail.com
twitter: @SPTHvx
[1] Daniel G. Gibson et al., "Creation of a Bacterial Cell Controlled by a
Chemically Synthesized Genome", Science 329, 52 (2010).
[2] SPTH, "Taking the redpill: Artificial Evolution in native x86 systems",
http://vxheaven.org/lib/vsp26.html, (2010).
SPTH, "Imitation of Life: Advanced system for native Artificial Evolution",
in valhalla#1, http://vxheaven.org/lib/vsp37.html, (2011).
PS: Thanks to hh86 for motivation. Thanks to the JCVI-team for their awesome
research, looking forward reading more discoveries on the boarder between
dead and living material!
PPS: I'm not a microbiologist (or biologist at all). Even if I tried as hard as
possible, I can not rule out that some assumptions might be wrong, some
things I might have misunderstand.
In any case, the main idea should be valid.