************************************************************ Infection of biological DNA with digital Computer Code by Second Part To Hell ************************************************************ Biological life spreads in the biological-chemical world, computer codes can spread in the digital computer world. That is a rule - no self- replicator has ever overcome the digital-biological barrier. Until today. Here I show a method how a digital computer code can infect biological DNA, thus spread in the biological-chemical world. The method is mainly based on the fantastic research by the J. Craig Venter Institute on synthetic life, and might ask new questions about the definition of life itself. 1) Introduction 2) Craig Venter's synthetical cell 2.1) General concept 2.2) Watermarks and DNA encoding language 3) FASTA file format 3.1) plain fasta-files 3.2) xml fasta-files 4) Infection scenario 4.1) Stealth technique in DNA 5) Conclusion 1) Introduction In 2010, the J. Craig Venter Institute (JCVI) reported the creation of a bacterial cell with a chemically synthesized genome [1]. They sequenced the DNA of a bacteria (M.mycoides), modified several parts of its DNA in the computer, synthetized the novel genome and transplanted it to a different bacteria's cell (M.capricolum). They observed the control of the cell only by the new DNA. For verification, they introduced artificial "watermarks" sequences (non-coded part of the DNA) to the genome, which contained among other things the names of the involved scientists (written in a specially designed DNA encoding alphabet). The artificially created genome was capable of continuous self-replication. They call their new artificial bacterial Mycoplasma mycoides JCVI-syn1.0. This is in my opinion one of the greatest scientific achievement in recent years. In this text I explain the implementation of a computer code that makes the step from the digital to the biological world. The computer code, written in C++, hosts the DNA sequence of M.mycoides JCVI-syn1.0. At runtime it acts as follows: 1) Preparing the DNA sequence of M.mycoides JCVI-syn1.0 in the memory, (with slightly modified watermarks). 2) Encoding own file-content in base32. The base32 code is then encoded in JCVI's DNA-encoded alphabet. 3) This representation of its digital form is then copied to a watermark of the bacteria's genome in memory. With this, a fully functional bacterial DNA sequence including the digital code is generated. 4) Next it searches for FASTA-files on the computer, which are text-based representations of DNA sequences, commonly used by many DNA sequence libraries. 5) For each FASTA-file, it replaces the original DNA with the bacterial DNA containing the digital form of the computer code. The code has a classical self-replication mechanism as well, to eventually end up on a computer in a microbiology-laboratory with the ability of creating DNA out of digital genomes (such as laboratories by the JCVI). If the scientists are incautious, the computer code's genome (instead of the intented original DNA) might be written to the biological cell. The new cell will start replicating in the biological world, and with it the representation of the digital computer code. 2) Craig Venter's synthetical cell 2.1) General concept The team of Craig Venter has demonstrated how to create bacteria controlled by artificially designed and synthesized DNA. For that, they used the sequenced DNA of a ~1 mega-base pair bacteria M.mycoides. They modified the genome on the computer - deactivated several genes, and introduced watermarks (artificial non-coding parts of the DNA). A company called Blue Heron sequenced 1000 bp fragments of the full DNA. With a three-step procedere, they assembled the full DNA. This was transplanted into an empty receiver cell of the bacteria M.capricolum. Amazingly, the cell with the new genom booted up, and was able to self-replicate. To verify that the expected genome was replicating, they introduced special functionality to the watermarks which are visible with chemical methods. In their article [1] they write: "This work provides a proof of principle for producing cells based on computer-designed genome sequences. DNA sequencing of a cellular genome allows storage of the genetic instructions for life as a digital file." The project describe here uses the method of their proof-of-principle. 2.2) Watermarks and DNA encoding language The watermarks are parts of the genome that are not translated into functional proteins. That means: They are part of the DNA, but have no functional effect on the behaviour of the cell. The watermarks are represented by nucleotides A,C,G,T. JCVI developed an encoding technique from DNA to human letters. Three nucleotides (one codon) represent one letter or ascii symbol. With that encoding methode, they encode readable information into the cell: It contains the name of the involved scientists, philosophical quotes and one html-code with an e-mail adresse. The encoding from codons to letters has never been documented explicitly, but can be deduced mainly from the implicit information given in the article. The known alphabet looks like this: TAG = a GCA = k TCC = u AGA = 4 CAC = / AGT = b AAC = l TTG = v GCG = 5 CCA = = TTT = c CAA = m GTC = w GCC = 6 CGA = . ATT = d TGC = n GGT = x TAT = 7 GAG = ! TAA = e CGT = o CAT = y CGC = 8 CAG = : GGC = f ACA = p TGG = z GTA = 9 GGA = " TAC = g TTA = q TCT = 0 ATA = space GTG = , TCA = h CTA = r CTT = 1 GGG = chr(10) TCG = @ CTG = i GCT = s ACT = 2 AGC = > CCC = - GTT = j TGA = t AAT = 3 CGG = < Four watermarks have been introduced to the modified bacterial DNA in the computer. As an example, a part of the DNA sequence of one watermark is: GCTTAATAAATATGATCACTGTGCTACGCTATATGCCGTTGAATATAGGCTATATGATC ATAACATATATAGCTATAAGTGATAAGTTCCTGAATATAGGCTATATGATCATAACATA TACAACTGTACTCATGAATAAGTTAACGA The sequence is divided into three-nucleotide parts (codons): GCT TAA TAA ATA TGA TCA CTG TGC TAC GCT ATA TGC CGT TGA ATA TAG GCT ATA TGA TCA TAA CAT ATA TAG CTA TAA GTG ATA AGT TCC TGA ATA TAG GCT ATA TGA TCA TAA CAT ATA CAA CTG TAC TCA TGA ATA AGT TAA CGA We can see in the above list that GCT stands for "s", TAA stands for "e", ATA is a space, TGA stands for "t" ... and so on. In the end we can extract the sentence: "see things not as they are, but as they might be." Obviously we can also write in this encoding technique: "hello vxers!" -> TCA TAA AAC AAC CGT ATA TTG GGT TAA CTA GCT GAG The full structure of the alphabet is not known ,therefor only 49 out of 64 codon's representation are presented here. However all of them are used in the watermark (i.e. there is no biological reason for not using specific codons). 3) FASTA file format Fasta files are textbased representations of nucleotide sequences, commonly used in micro-biologic libraries. There are two fasta-file types that I will describe here. The first one is plain fasta-format (which usually have the file-extention .fasta or .fas. Both are available from the genome-database http://www.ncbi.nlm.nih.gov/. For example, if you want to see the DNA of Mycoplasma mycoides JCVI-syn1.0: http://www.ncbi.nlm.nih.gov/nuccore/296455217 or something more common: E.coli http://www.ncbi.nlm.nih.gov/nuccore/BA000007.2 3.1) plain fasta-files The plain fasta-files have a small header, followed by a plain representation of the DNA in the nucleotide basis (A, T, G, C). Two examples: a) Mycoplasma mycoides JCVI-syn1.0 This is about 1MB of data - - - - - - - - Mycoplasma mycoides JCVI-syn1.0.fasta - - - - - - - - >gi|296455217|gb|CP002027.1| Synthetic Mycoplasma mycoides JCVI-syn1.0 clone sMmYCp235-1, complete sequence ATGAACGTAAACGATATTTTAAAAGAACTTAAACTAAGTTTAATGGCTAATAAAAATATTGATGAATCCG TGTATAACGACTATATAAAGACAATAAATATTCATAAAAAGGGGTTTTCTGATTATATTGTTGTTGTTAA ATCACAATTTGGTTTGTTAGCTATAAAACAGTTTCGTCAAACTATTGAAAATGAGATAAAAAATATTTTA AAAGAACCTGTAAATATTAGTTTTACATACGAACAAGAATATAAAAAACAACTAGAAAAAGATGAATTAA TTAATAAAGATCATTCTGATATCATTACTAAAAAAGTTAAAAAAACTAATGAAAACACTTTTGAAAATTT ... - - - - - - - - Mycoplasma mycoides JCVI-syn1.0.fasta - - - - - - - - b) Escherichia coli This is about 5.5MB of data - - - - - - - - - - - - - - - E.coli.fasta - - - - - - - - - - - - - - - >gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTCTGACAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCACCACCATCACCATTACCATTACCACAGGTAACGGTGCGGGCTGACGCGTAC AGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTCGACCAAAGGTAACGAGGTAACA ... - - - - - - - - - - - - - - - E.coli.fasta - - - - - - - - - - - - - - - 3.2) xml fasta-files The second form is pure DNA aswell, however in a small xml-file. Two examples again: - - - - - - - - Mycoplasma mycoides JCVI-syn1.0.fasta.xml - - - - - - - - 296455217 CP002027.1 766747 synthetic Mycoplasma mycoides JCVI-syn1.0 Synthetic Mycoplasma mycoides JCVI-syn1.0 clone sMmYCp235-1, complete sequence 1078809 ATGAACGTAAACGATATTTTAAAAGAACTTAAACTAAGTTTAATGGCTAATAAAAATATTGATGAATCCGTGTATAACGACTATATAAAGACAATAAATATTCATAAAAAGGGGTTTTCTGATTATATTGTTGTTGTTAAATCA... - - - - - - - - Mycoplasma mycoides JCVI-syn1.0.fasta - - - - - - - - or E.coli again: - - - - - - - - - - - - - - - E.coli.fasta.xml - - - - - - - - - - - - - - - 47118301 BA000007.2 386585 Escherichia coli O157:H7 str. Sakai Escherichia coli O157:H7 str. Sakai DNA, complete genome 5498450 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTCTGACAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATA... - - - - - - - - - - - - - - - E.coli.fasta.xml - - - - - - - - - - - - - - - 4) Infection scenario The strategy of this digitally and biologically self-replicating code is the following: It starts as a digital computer file, and replicates itself via local networks, USB sticks and other removeable devices. There are two potential scenarios to step from the digital to the biological world: 1) The self-replicating code might end up at a USB stick of a microbiologist. (S)he runs it unintentionally at a computer that host DNA sequences (stored in the common fasta-file format) which will be synthesized and transplanted to receiving cells (such as in the labs of JCVI). The computer code will find these fasta-files and replace their DNA sequences with the bacterial genome of M.mycoides. This genome contains a watermark with the DNA-representation of the file-content of the computer code. When the DNA files are synthesized, the computer code is synthesized aswell, and will continuously self-replicate in the biological world in the form of a bacteria. 2) In this scenario, the code gets to the computer of a genome library (such as NCBI, National Center for Biotechnology Information). The computer code will search for FASTA files and replace their DNA content with its own DNA code. The employee will unintentionally upload the computer code's DNA instead of the original DNA. Then - back in a laboratory like that from JCVI - scientists will download the modified DNA sequence. When they synthesize the wrong DNA sequence, the computer code lands in the bacterial cell again, again capable of continuously self-replicate in the biological world. There is a different interesting scenario: First, Mycoplasma mycoides bacteria are usually infecting cattles and goats. Imagine an unknown outbreak of the here presented bacteria. Goats or cattles would get sick, and microbiologists want to know the exact reason. They take samples of the infectious cells and sequence them in their laboratories. Now they see the DNA, and find out that the bacteria contains a rather big non-coded sequence - the watermark. They find this very unnatural and analyse the watermark, also by applying Craig Venter's DNA encoding alphabet (because it is very famous due to their first fascinating results). After decoding, they see that the code only contains a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,2,3,4,5,6,7 This is a curious structure, they research a bit and see that its base32- encoding. They decode it, and see 'M','Z',0x90,0x0,... They immediatly see that its a windows executeable, and I guess would be surprised :) 4.1) Stealth technique in the DNA In their modified genome of M.mycoides JCVI-syn1.0, the JCVI-team introduces four watermarks. Every watermark contains a special sequence which is useed to test whether a cell has the intended genome or is a contamination (for example, from the receiving cell). In the supplementary material of their article [1], they describe the exact representation of these sequences (primer). Each of the four watermarks contain one primer. When they perform a multiplex PCR, each watermark creats a specific characteristic. In my code, I removed the total original content of all watermarks, except for the identified primer-sequence. As a result, when a team tests the bacteria cell with the representation of the digital code, it will have the same characteristic as their original designed DNA. Thus the computer code's DNA will pass this test. 5) Conclusion I've shown the implementation of a technique that allows a digital computer code to make the step to the biological world. This is done by infecting a DNA-file with the genome of a self-replicating biological bacteria. The bacteria's genome contains the digital code of the self-replicator in form of a base32-representation encoded via Craig Venter's DNA encoding alphabet. The biological bacteria will self-replicator in the biological world, and so will the representation of the digital computer code. The outbreak-probability of such cross-domain infectors is very low. The researchers in [1] have made ethical studies, and I'm convinced that they came up with perfect protections against potential attacks as this. Finally, digital self-replicators are usually not considered as a form of life, even they fulfill the most important characteristic of life: capability of self-replication and subject to evolution [2]. I wonder whether this computer code can count as a form of life - if so, I would call it Mycoplasma mycoides SPTH-syn1.0 :) Second Part To Hell October 2013 http://spth.virii.lu/ sperl.thomas@gmail.com twitter: @SPTHvx [1] Daniel G. Gibson et al., "Creation of a Bacterial Cell Controlled by a Chemically Synthesized Genome", Science 329, 52 (2010). [2] SPTH, "Taking the redpill: Artificial Evolution in native x86 systems", http://vxheaven.org/lib/vsp26.html, (2010). SPTH, "Imitation of Life: Advanced system for native Artificial Evolution", in valhalla#1, http://vxheaven.org/lib/vsp37.html, (2011). PS: Thanks to hh86 for motivation. Thanks to the JCVI-team for their awesome research, looking forward reading more discoveries on the boarder between dead and living material! PPS: I'm not a microbiologist (or biologist at all). Even if I tried as hard as possible, I can not rule out that some assumptions might be wrong, some things I might have misunderstand. In any case, the main idea should be valid.