CS 2550 - Foundations of Cybersecurity

Project 4: Anti-virus

This project is due at 11:59pm on Tuesday April 17, 2018.

Description and Deliverables

In this project, you will gain hands-on experience with a core technique in defensive cybersecurity: signature matching. You will develop a simple anti-virus that (1) create signatures that match known malware, and then (2) examines unknown binaries to determine if they contain a malware signature. You will be provided with malware and benign binaries to help train your anti-virus.

To receive full credit for this project, you will turn in (at least) three things:

  1. A program named av-train that analyzes some given binaries and produces signatures of malware
  2. A program named av-detect that analyzes some given binaries and determines if each one matches a malware signature or not
  3. A Makefile that compiles your two programs (or is empty and does nothing, if you're using a language that doesn't require compilation)
The exact format of these deliverables are described in details below.

WARNING: Live Malware!

As part of this assignment you will download and decompress an archive that is full of live Linux malware. I repeat: THESE ARE LIVE MALICIOUS BINARIES. Under no circumstances should you execute these binaries, on any system. When you decompress the archive, your actual anti-virus program may flip out (because: malware!); you should tell your anti-virus to ignore the files, since you're smart and know not to run them.

Project 4 Server

Since this project involves live malware, you are welcome to develop your anti-virus on our server. Our server is available via SSH at:
This will save you the trouble of having to download the malware onto your own machine. That said, you are welcome to develop on your own machine if you wish, although we suggest that you place the malware inside a virtual machine just to be on the safe side.

DO NOT DOWNLOAD MALWARE ONTO THE CCIS MACHINES. You should only work on this assignment on our server, or your own machine. Systems will not be happy if a bunch of malware shows up on their machines.

About Signatures and Anti-virus

A malware signature is a string of length n that uniquely identifies a specific piece of malware (or family of malware, in the case of variants). The signature is typically a string of bytes extracted directly from the malware binary. The key is to identify a string of bytes that (1) exists in the malware but (2) does not exist in any benign binaries (i.e. all the normal, non-malware binaries on your system).

The job of an anti-virus program is to identify malware using a set of signatures. It is critical that the signatures strings actually exist in the malware binaries so that they can be detected (true positives). If there is a malware binary that is not matched by any signatures, then the anti-virus will not be able to detect the malware (false negative). At the same time, it would be very bad if the signatures accidentally matched good, benign binaries (false positives), since this would result in the anti-virus quarantining or deleting binaries that the user actually needs and wants. Ideally, an anti-virus should never detect a benign binary as malware (true negatives).

Goals and Datasets

In this assignment, your goal is to develop a complete anti-virus system that maximizes true positives (malware detections) and true negatives (not detecting benign binaries), while also minimizing false negatives (malware that is missed) and false positives (benign binaries that are mistaken for malware). You will develop two programs: av-train and av-detect, the former of which creates signatures from known binaries, and the latter of which uses the signatures to classify unknown binaries.

To achieve these goals, we have produced four datasets:

In other words, you will use the two public datasets to develop, debug, and test your anti-virus system. In turn, we will evaluate and grade your system based on the two private datasets.

The safe_pub and malware_pub datasets are already available on our server. You can find them at /tmp/malware_pub/ and /tmp/safe_pub/.


The first program you will develop is av-train. This program takes three parameters as input: (1) a directory containing malware binaries, (2) a directory containing benign binaries, and (3) the name of a file that will contain the set of malware signatures that you derive from the given directory of malware. Obviously, your goal is to produce signatures that maximize true positives and true negatives, while minimizing false positives and false negatives.

Your av-train program must support the following command line syntax:

$ ./av-train <malware directory> <benign directory> <output signature file>
The first two parameters are paths to directories containing one or more binaries (malware binaries in the first case, and benign binaries in the second case). During your development, the former path will point to the decompressed malware_pub.tar.gz, and the latter path will point to the decompressed safe_pub.tar.gz. The third parameter is the file where your av-train program should store its output (i.e. the malware signatures). You may store your output in whatever format you wish; just keep in mind that your av-detect program will need to read this file as input, so you should do yourself a favor and use an output format that is easy to parse. Your signature file does not need to be in plaintext or human-readable; it may be a binary file format.

As an example, an invocation of your av-train program might look like this:

$ ./av-train malware_pub/ safe_pub/ signatures.av
Your av-train program may print output to STDOUT or STDERR if you wish; we will not consider this output when grading your program.


The second program you will develop is av-detect. This program takes at least one, and possibly more, command line parameters:
$ ./av-detect <input signature file> [unknown binary 1] [unknown binary 2] ... [unknown binary n]
The first parameter is the signature file produced by your av-train program. All of the other parameters are unknown binaries: for each given unknown binary, your av-detect program should print to STDOUT (1) the name of the file and (2) whether it is "MALWARE" or "SAFE". Note that the first parameter (the signature file) is required; the list of unknown binaries is not required, and can be of any length.

Example invocations of your av-detect program might look like the following:

$ ./av-detect signatures.av
$ ./av-detect signatures.av weird_binary
weird_binary: MALWARE
$ ./av-detect signatures.av thing1 thing2
thing1: SAFE
thing2: MALWARE
$ ./av-detect signatures.av malware_pub/*
00381f84c8ca598b1fe3a69dca816586: MALWARE
00744ba3546a01e8c2a3cb3711c3ca85: MALWARE
0086eced29d57421ec8778f1f3084915: MALWARE
0093fdcb12b6fb836495b7cd53d19ddb: MALWARE
$ ./av-detect signatures.av safe_pub/*
0026c7cf39fd8fdd895b64094568ed9a: SAFE
00280fcd483258c492ea93acd2625a86: SAFE
0036df2fcd3fe2616c7a83dc07346598: SAFE
004569d222ee81d3e0b39aa1864fb1a4: SAFE
Note that the output format of your av-detect program must match this example output precisely. Do not output any additional information to STDOUT. Make sure to print the file name followed by colon+space and then "MALWARE" or "SAFE". You may print additional output to STDERR; we will not consider this ouput when grading your program.

How to Think About This Assignment

At a high-level, the av-detect program is simple: load the given signatures, each of which is a string, then check each binary given on the command line to see if it contains any of the signature strings. If there is a hit, then output "MALWARE", otherwise output "SAFE". This is a fairly straightforward exercise in string matching.

The challenge is the av-train program. This program needs to create the signatures: for every given malware binary, it needs to find a string that does exist in the malware, but does not exist in any of the given benign programs. The crux of the assignment is finding concise (i.e. relatively short) signatures that produce true positives for all the given malware, true negatives for all the given benign binaries, and no false positives or false negatives.

Now you may be thinking: "This assignment is easy! My signatures will be cryptographic hashes (e.g. sha1 or sha256) of the entire malware binaries." For example, let's say you are given a malware binary that contains "ABC123" (this is nonsensical, but lets run with it). The sha256 signature of the whole file would be "e0bebd22819993425814866b62701e2919ea26f1370499c1037b53b9d49c2c8a". This seems like a good signature because it's highly likely that this sha256 hash uniquely identifies the malware binary. But, there is a huge problem: if the malware changes by even a single byte, your signature will no longer be able to detect it. For example, a crafty malware author might produce a second binary containing "ABC124". The sha256 hash of this second malware sample is "cf54666dad368b8b4e183940cb4d2569aa3f84d66440318325b18952fed2edd9". In other words, signatures that encode the entire contents of a malware binary are too specific; this makes them brittle in the face of obfuscation.

In the private datasets that we will use to grade your programs, there will be malware binaries like this; i.e. they belong to the same family, so most of their bytes are the same, but not all. If you use full-file signatures (using cryptographic hashes or some other encoding scheme), your av-detect program will suffer from false negatives and you will lose points.

So what about the other end of the spectrum: lets say that you choose to use signatures of length one, i.e. a single byte. In this regime, your av-train program will attempt to find a single byte in each malware file that does not exist in any benign programs. Given that a byte can only contain 256 values, does it seem likely that this is feasible? Absolutely not. Single byte signatures are not specific enough, and thus your av-detect program will suffer from false positives and you will lose points.

The trick is for your av-train program to produce signatures of length n, where n is long enough to be specific to malware and not produce false positives, but not so long that it becomes brittle and suffers from false negatives when it encounters malware that is slightly different than the training binaries.

Submitting Your Project

Before turning in the project, you must register yourself for our grading system using the following command:
$ /course/cs2550/bin/register-student [NUID]
NUID is your Northeastern ID number, including any leading zeroes. This command is available on all of the CCIS lab machines.

The exact files that you submit for this assignment will vary depending on the programming language you choose to use. At a minimum, you will probably submit:

You submit your project by running the turn-in script as follows:
$ /course/cs2550/bin/turnin project4 <project directory>
where <project directory> is the name of the directory with your submission. The script will print out every file that you are submitting, so make sure that it prints out all of the files you wish to submit! The turn-in script will not accept submissions that are missing a Makefile. You may submit as many times as you wish; only the last submission will be graded, and the time of the last submission will determine whether your assignment is late.

Note that you do not need to turn in the signatures that you create for the public datasets.

At any time, you can run the following command to see all of your current grades for projects, essays, quizzes, and tests.

$ /course/cs2550/bin/gradesheet


This project is worth 12% of your final grade, broken down as follows (out of 100): Points can be lost for turning in files in incorrect formats (e.g. not UNIX-line break ASCII), or failing to follow specified formatting and naming conventions.

We will use your av-train program to create new signatures based on the private datasets, and then we will evaluate your av-detect program on the private datasets using these new signatures. The (rough) sequence of commands that we will use to evaluate your programs are:

$ make
$ ./av-train malware_priv_train/ safe_priv/ signatures
$ ./av-detect signatures malware_priv_train/* malware_priv_test/* safe_priv/*
Notice the split between training and testing malware: the testing set contains malware that is similar, but not identical, to the malware in the training set. Your av-train program will not be given access to the testing malware. The signatures you calculate for the training set must be good enough to detect malware in both the training and testing sets.

Bonus Points

This assignment contains a competitive element for bonus points. Specifically, the top k students whose av-train programs produce the smallest signature files will receive bonus points. k will be determined once all submissions are graded. The most concise signature files will receive 2% bonuses (on top of the 12% project grade); runner-ups with receive 1% bonuses.

Forbidden Techniques

Clever students will quickly realize that there are creative ways to complete this assignment that technically work, in the sense that they will correctly detect malware, however they violate the spirit of the assignment in various ways. As such the following list of techniques are forbidden: We reserve the right to modify the list of forbidden techniques over time, if we observe program behavior that clearly violates the spirit of the assignment. Use of online services to detect malware is considered cheating, and the penalties will be commensurate. Attempting to minimize the size of your signature file via forbidden techniques is not considered cheating, but it will disqualify you from receiving extra credit.