Hey guys,
I recently asked Hassan if I could write a piece on computer viruses and what makes them tick. He told me I could, so here's the first part in a series about computer viruses, their many forms and the most difficult and complex code (in my opinion) you're likely to find short of your dreams.
Warning:
I need to mention a few things before venturing onwards:
Nowhere in this series I'm going to hand you dangerous or harmful code that you can just copy and paste and fry someone's computer. That doesn't mean that everything in this series is going to be 100% harmless. I urge you not to put what you learn here to malicious use. Finally notice that there are anti-virus programs that cost as much as $80; there's a reason for that...
I'm not (yet) an authority on the subject, I've spent a great deal of time creating my own viruses but despite my best effort the best I managed yet was only a polymorphic virus with moderate mutation rates (more on this later). So although I won't be able to give a working example of more complex viruses, I can give and explain the theory to you.
A primer
Since this is just the introduction to the fascinating land of computer viruses, let's start with the bare basics. I'm going to tell you about the different ranks of viruses and I hope this will open your eyes to the dangers of downloading files on the internet (even though you've probably heard people say this a 1000 times to you already).
So, the most (basic??) basic computer viruses are relatively(!) simple programs they consist of only two parts. The replication engine (replicator) and the payload. The replication engine contains code to infect other files and the payload is everything that is not related to replication. For example: there are viruses that would spread up until a certain date and would then proceed to show a message box. In this case the payload is the showing of a message box. A basic virus will make no attempts to hide itself or apply stealth, it'll just go and infect every file that's within the scope of it's replication engine (every file in it's current directory, every file in public folders, every file, etc.)
This kind of virus is relatively easy to deal with for antivirus companies. They have to find a string of bytes that remains the same each generation of the virus and is unique to that virus. Since the virus does not change shape or size, any (long) piece of code from the virus will do.
The next level computer virus is the encrypting virus (with variable keys). They still only consist of two parts, but the replicator is more advanced than that of the basic virus. It has been upgraded with cryptographic functions, so the virus can encrypt and decrypt its code to change its signature. Now each generation of the virus will look different from the previous. The key used to encrypt the code is changed from the previous host, and as such the encrypted code looks completely different. However there's still one last piece of code that remains unchanged: the decryptor stub. The stub is the same across all generations of the virus, and thus it's possible to create a static hash of this virus.
It figures that the next generation tries to rid itself of the static decryptor stub. The polymorphic virus generates a new decryptor stub every time it infects another file. It encrypts the rest of it's code and the virus consists of three parts, the replicator, the payload and the polymorphic engine. The polymorphic engine generates the decryptor stub. Let's discuss that for a moment.
There are different mutation techniques available, ranging from very complex to relatively easy. A number of these techniques are called:
-Register swapping (regswap)
For example if you have:
Code:
mov eax, ecx
xor ecx, ecx
and eax, 2
Would change into:
Code:
mov ebx, ecx
xor ecx, ecx
and ebx, 2
-Permutation
If you have:
Code:
mov eax, 3
inc ecx
add eax, ecx
Then you get:
Code:
inc ecx
add eax, ecx
add eax, 3
-Junk insertion:
If you have:
Code:
mov eax, 3
push ecx
xchg eax, ecx
pop eax
It would become:
Code:
mov eax, 3
je I_Get_Never_Executed ;junk
push ecx
xchg eax, ecx
mov eax, 346829342 ; junk
sub eax, 2 ; junk
and eax, 3 ; junk
pop eax
I'm sure the complexity of the polymorphic engine must be apparent for anyone with even just a passing experience of the assembler language. The engine must be able to calculate offsets for jumps, reassemble instructions and trace execution paths to perform permutation (it has to know which instructions can be placed in front of another while not changing the total outcome). And it has to do all this while being small and offset independent, there are even such engines written in the assembler language!
While the polymorphic virus seemed to be a game breaker for the anti-virus companies there shone one speck of light at their horizon. You see, most virus creators where just programmers and hobbyists, they have no notion of number theory and advanced maths. Remember that the polymorphic virus still preforms encryption on the larger part of it's code? There's the problem. Decrypting code, no matter how you look at it, is always a mathematical operation performed on numbers. As such the whole mutation process can be seen as a gigantic math formula. Consider this:
A + B is the step needed to perform decryption, C is the final (executable) state of the virus, the virus is decrypted in this state:
No matter how many other operations you perform (permutation, regswap, junk ) steps A and B must always be performed to decrypt the virus.
So:
Code:
A + E + Z - E + B - Z = C
E and Z represent junk and permutation in the code
Still contains A + B in that order. Now solving the most basic versions of polymorphic code can be as simple as:
Pseudo code of an emulator in av software
Code:
while( Execute_instruction() ) {
if( Get_result() == c )
printf("Virus!!!!\n");
}
So the next generation of viruses left the good 'ol encryption habits of it's predecessors and did something astonishing. They are called metamorphic viruses. These most complex pieces of code perform mutation on
the entire virus. Every. single. generation. has a different binary footprint. Therefore they don't need to encrypt their code. The consist of three parts: the replication engine, the metamorphic engine ( mutator ) and the payload. Just like the evolution of the replication engine ( going from replication to replication + encryption in the encrypted virus) the polymorphic engine grows into the metamorphic engine. Where the polymorphic engine only generates 1 lonely function to decrypt the encrypted virus the metamorphic engine reassembles the entire virus every time it's ran. Like the polymorphic engine it uses permutation, junk insertion and regswaps and other less common techniques also found in polymorphic engines. However the best of metamorphic code gets two other, very potent, techniques: Cross architecture infection. The best engines are capable of translating their machine code into a different architectures, switching between ia32, AMD64 and ARM (and perhaps others). The second technique is cross platform infection, where a virus can infect multiple platforms (like linux, mac and windows (or android and ios) ). A virus will need a hard coded map of opcodes, instructions and system calls to perform these feats, but that's hardly inconvenient since cryptography can be used there and tables of opcodes could possibly be mutated.
The secret behind metamorphic viruses is to use a representation of the virus in a more convenient language then assembler. While there are metamorphic viruses that can work on bare machine code, they are not recent and they tend to be huge and
very complex. The idea is to write a language that's easy to mutate and easy to translate to machine code. A language with fixed instruction lengths would do very nicely. Say you you make all instructions a size of 16 bytes, if you translate the move a,b instruction into machine code. The machine code is likely to be smaller then 16 bytes. Thus you have made space to include junk code, or to place other instructions (permutation).
Metamorphic viruses (if done properly) can only be detected by anti-virus companies after thorough examination, often a couple of 100 man hours, before a suitable string of bytes or greater reappearing pattern can be seen. This process can't be automated like with the polymorphic virus because there's no final state C. It's not as simple as saying A + B = C. Since the virus does not settle back into a known state, but keeps evolving. The unlucky (team of) employees must find the code that's vital to the proper execution of a virus. Consider:
Code:
mov eax, 1
mov eax, 3
sub eax, 2
There is a near infinite number of ways to do the exact same thing, without using this code. But the fact remains that a register, at the end all this code, must contain the value of 1. This is what you'd be looking for as a anti-virus company.
This concludes the first introduction to the virus world, I'm sorry if you'd expected some code or more examples, but rest assured the next part will contain some more exciting stuff!