COS 126 Assignment 7 Checklist

Part 0: preparation

Do the readings and exercises on strings.

Copy the following files from the gene directory into an empty directory.

You can use gene.c as starting point. It handles the command line input. You will execute with

a.out prot.1 gene.1

Don't accidentally reverse the order of the two command line arguments.

Some genetic jargon you should get used to.

nucleotide - one character from the set {a, c, t, g}

codon - 3 nucleotides, e.g., att, agc, or tat

amino acid - one uppercase character from A through Y

protein - sequence of amino acids, e.g., MSIQHMR

Part 1: hash()

First, get hash() working and debugged.

The input to hash() is an array of characters from the set {a, c, t, g}. Only the first 3 characters (corresponding to a codon) of the input array will be used. The output (or return value) is an integer between 0 and 63. For example, if the first three characters of the array are aaa, return 0. If the input is aac, return 1. If the input is aag, return 2. If the input is ttg, return 62. If the input is ttt return 63.

It will be helpful to think of the three characters (codon) as an integer represented in base 4, with the mapping a=0, c=1, g=2, and t=3. Your job is to convert this to a C integer. This is analogous to converting between the base 4 and base 10 representations of an integer.

To get started, you may want to write a small helper function char_to_int() that converts a character into an integer: a to 0, c to 1, g to 2, and t to 3.

One approach for debugging hash() is to first comment out the portion of code that prints out the results, and replace it with printf() statements like the following:

printf("%d %d\n", hash("att"), hash("gct"));

You should get the following output:

15 39

Part 2: unhash()

Now, get the unhash() function working and debugged.

The input is an integer between 0 and 63. The function does not produce any output. Instead, it prints to the screen the 3 characters corresponding to this integer (see above).

As in the hash() function, you may want to start by writing a helper function int_to_char() that converts an integer to its corresponding character. It is the inverse of char_to_int().

There are many ways to write this function. It boils down to converting between the base 10 and base 4 representations of an integer.

To debug, you can replace the printf statements above with:

unhash(15);
unhash(39);

This should produce the following output:

att gct

Part 3: Input and Output

Before you can do the pattern matching part of the assignment, you will need to read in the protein and gene sequences using file input. We'll also review some of the key variables that you'll use.

geneseq[] is a string that holds the sequence of {'a', 'c', 't', 'g'} characters that are read in from the gene input file.

protseq[] is a string that holds the sequence of {'A', 'B', ... ,'Y'} that are read in from the protein input file.

genecode[] is a 64 character array that you will use to keep track of the matches you have made. Understanding the purpose of this array is crucial to completing the assignment. An explanation follows, but if you are unsure, get clarification from a preceptor before writing any more code.

Each of the 64 entries in genecode[] corresponds to one of the 3-character codons. You would like to be able to use genecode["att"] to access the array value corresponding to the codon att. Unfortunately, C requires that array indices be integers. This is the whole purpose of hash() and unhash(): they allow you to use codons to index the array. To access the array element corresponding to the codon att, use genecode[hash("att")]. This is the same as genecode[15], which is now valid C. Similarly, genecode[0] corresponds to the codon "aaa", and genecode[1] corresponds to the codon "aac", and so on.

Each element of genecode[] holds a single character: a capital letters corresponding to one of the 25 amino acids. Whenever you store an amino acid in genecode[], you are matching a codon with an amino acid. For example, setting genecode[15] = 'E' says that the amino acid 'E' is encoded by the codon "att". The goal of this assignment is to find a consistent matching of codons to amino acids, and produce a table like gene.3.ans.

Your first task is to read in the protein file into the protseq[] array. Be sure that all values in the array are uppercase characters 'A' through 'Y'. After the last amino acid character is read in, insert the null character '\0' to the end of the array to denote the end of the protein string. Print out the resulting string to standard output using printf("%s\n", protseq) to make sure you read it in successfully. Hint: see the last exercise question on strings.

Now, write code to read in the genetic sequence into the geneseq[] array. Print it out to make sure you read it in properly.

The last part of main() prints out the table of amino acid encodings. This is the only place the code uses unhash(). It prints out each value of the genecode[] array, along with the corresponding codon. If you like, you may wish to modify it so that it prints out the table in 4 columns instead of 1.

Part 4a: pattern matching

This is the trickiest part of the assignment. You should carefully figure out a plan of attack before writing code. In this part, we describe how to check whether a match occurs at one particular offset. In the next part, you will add an extra loop that checks for matches at all possible offsets. Here's a sketch of what you need to do.

Initialize each element of the genecode[] array to '-' .

You will probably want two integer variables, say i and j to hold the current index into the geneseq[] and protseq[] arrays. Initially i will be set to the offset, and j to 0.

To test for a possible alignment at the given offset, repeat the following until you run out of amino acids in your protein sequence. (Consider writing a loop that counts from j = 0 to the length of the protein sequence.)

The current codon is comprised of geneseq[i], geneseq[i+1], and geneseq[i+2].

Look up the current codon in the genecode[] table.

If the amino acid stored there does not match the current amino acid prot[j] exit the loop, perhaps using a break statement.

Otherwise, if the entry in the genecode[] table is blank, store the current amino acid there.

Increment i by 3 and j by 1.

The tricky part is looking up the codon in the genecode[] table. This involves calling hash() with a pointer to geneseq[i]. This is the only place in the matching phase where you'll need to use a pointer. Recall that geneseq + 17 is one way to denote a pointer to element 17th of the geneseq[] array. Use prot.1 and gene.1 to test your code.

When you exit the loop outlined above, you need to know whether it was because you reached the end of the protein sequence (a match) or because a conflict occurred. If you found a conflict, print out the position in the protein sequence where it occurred.

If you initialize the variables that keep track of the current position in the geneseq[] and protseq[] arrays to zero, then you should get the following debugging output.

To test your code, try initializing the variable which indexes the current position in the geneseq[] array to 2, 3, and 10. If the initial value is 10, you should find a match and get the following match output.

Here are some debugging hints.

You may wish to use the strlen() library function.

Lots of people accidentally use = instead of ==, so consider yourself warned.

Part 4b: pattern matching

At the end of the last step, you changed the offset into the geneseq[] array by editing your code. If you happen to choose the right offset (10), then you find the match. Modify your code so that it checks all possible offsets.

You will need to create an outer loop to repeat the pattern matching code you wrote in the previous step. Determine the conditions under which you will want to execute the loop, so that your program won't crash if no match is found.

Be sure to print out the position where the match occurred.

Don't forget to reinitialize the genecode[] array to '-' next time through the loop!

Written by Lisa Worthington and Kevin Wayne