Detecting oh, roughly every polymorphic engine out there. By Rhincewind [Vlad] This article assumes that the reader has basic knowledge of polymorphic engines, mathematical reasoning and logics. I suggest to those who lack the first to read Dark Angel's fine two-part 'polymorphism primer', published in issues 11 and 12 of the Phalcon/Skism publication 40hex. ---- Definitions and terminology, and some general bits. The ideal polymorphic engine is popularly defined as a program that can transform a given block of code into another block of code that will have the exact same functionality and that any number of transformed blocks, all different instances of the same original code, share no predictable patterns as the presence of such patterns can be used to deduce the presence of encrypted code the engine is meant to hide. The most commonly used method by far to approximate this ideal is to apply an encryption loop on the given code, and prepend a program to the encrypted block of code that will decrypt it, and then pass on execution as if the program hadn't been encrypted at all. To get anywhere close to the ideal defined above, the prepended program most of course also vary, and this is where the real benefit of this approach comes in. Instead of having to vary a large block of code, only a decryption loop needs to be varied, typically having a size of just a little over 10 bytes. The advantages should be obvious. The disadvantages are a little less obvious and they are the main topic of this article. The decryption algorithm is kept as simple as possible as the complexity of the decryptor has a direct relation to the complexity of the code needed to succesfully vary it. Result? Encryption schemes even Caesar would've been ashamed of using: usually one round of operation, transforming each data unit U to it's crypttext equivalent U'. Yes transforming, it really isn't that much like encryption at all but alas, terminology is easier when this is regarded as encryption. As said, unit per unit, typically byte- or word-length, code is transformed. The detransformation procedure merely performs the reciprocal operation, et voila. Typical transformational operations are ADD, SUB, XOR and bit rotations. All these are binary operators, and, viewed as encryption can be said to use a key K. U' = U + K, or U' = U - K or U' = U XOR K. ---- How to detect? And a small bit o' maths. An engine would be undetectable if 1) the decryptor is indistinguishable from programs without a decrypting function and 2) the encrypted code is indistinguishable from random data. Of course it would help if the decryptor were like random data but alas, the restrictions imposed on programs by the processor's instruction set do not permit this. Decryptor attacks come in a few flavors but are basically all the same. They see if a fragment of code could have been generated by the engine the attack is looking for. This is obviously a flawed attack as parts of a legitimate program may very well have been generated by an engine. Hiding polymorphically generated code in the middle of a legitimate program makes flawless detection of this type virtually impossible, see Commander Bomber. The most foolish of decryptor attacks traces the decrypting code, looking for processor opcodes the engine searched for can't generate. Prepending one or two instructions foreign to the engine will throw off this attack. Quite a few of the lesser anti-virus programs can be fooled by this. A more sophisticated, but related attack involves statistical analysis. Perhaps it's worth a description in another issue. Votes please! No, the real attacks of the moment are different. They focus on the encrypted code. There may not be a predictable unit sequence, but the units have distinct mathematical relations to eachother as all were transformed using the same operation. Exploiting this is not hard at all if the reciprocal operator is known. It is possible to transform corresponding parts of crypt- and plaintext independently into a common form. To use this for detection you assume a string of units to be an encrypted version of a string of plaintext, compute what should be their common form, once from the plaintext, once from the crypttext, and see if it's a match. The manipulations needed to get to this common form are different for each encryption type, but are for single loops easily obtained as follows: A' = A OP K and B' = B OP K where A and B are plaintext (unencrypted) data units, and A' and B' form their crypttext. Now we can derive: A' NegOP B' == (A OP K) NegOP (B OP K) == A OP K NegOP B NegOP K == A NegOP B. Yes, A' NegOP B' == A NegOP B! To illustrate this: A' XOR B' = A XOR B if the code was encrypted with a XOR loop. You XOR consecutive unitpairs of crypttext, and of your plaintext signature and check the results. Pairs meaning D[n] and D[n+1], followed by D[n+1] and D[n+2], with D as a unit-length array of data. Of course this also works for a bit more complex loops, such as an addition loop where K is increased by one every iteration, and one unit is transformed per iteration. A' = A + K B' = B + K + 1 A' - B' == (A + K) - (B + K + 1) == A + K - B - K - 1 == A - B - 1. A' - B' == A - B - 1 Now you compute A' - B' and A - B - 1 and compare the results. As you can see, the operations needed to transform crypt- and plaintext to a common form need not be equal, as long as both results are. This attack is called cryptanalysis; you check if a string of units could be an encrypted version of a plaintext string. False positive error chance drops rapidly as signaturelength increases following 1/2^(unit_bitlen*unit_siglen). Now try another one. A XOR loop where the loop counter (=units left to transform) is added to the key K each iteration. A' = A XOR K B' = A XOR (K + C) A' XOR B' == (A XOR K) XOR (A XOR (K+C)) == ??? We can't eliminate K! If you can't eliminate K, then a common form cannot be computed from A and B. The solution is to recover K using a cryptological known plaintext attack. This kind of attack happens to be ridiculously easy on 'ciphers' (sic) like these. K = A' XOR A K + C = B' XOR B Now we have K. This is need not be the K of the first byte encrypted, just start somewhere in the middle of call it THE K, like the pioneers who coincidentally found some land and decided to call it theirs. If it were an easier loop we could take K and decrypt all of the code, but alas, now we must first recover the counter C. K + C - K = C == B' XOR B - A' XOR A = C Now we can decrypt the Nth unit, A being unit 0, by U=U' XOR (K+N*C) and comparing U against the unit in our plaintext signature. This attack is usually also referred to as being cryptanalytic, but technically it falls in the domain of cryptology as you actually decrypt what is encrypted. For this type it is vital that you operate on pairs of consecutive units only. With most cryptanalytic attacks, you may just as well muddle about. ---- The limit of these attacks. Nearly all polymorphic engines to date can be caught using these simple, what I like to call 'crypto-sweeps'. Things get a bit more complex when multiple layer encryption is used. Let's take an engine that encrypts each unit using an ADD and a XOR: A' = (A+K1) + K2 B' = (B+K1) + K2 Looks easy: A' XOR B' == (A+K1) XOR (B+K1) But now you need to recover or eliminate K1 which is not at all easy with the XOR in the middle. With even more layers, it becomes practically impossible to decrypt and the weight is pushed back to the complexity of the decryptor. All in all I'd say that what we've seen so far is just the start of polymorphism. Territory to be conquered, things unthought of to be thought of, and so on. Rhince.