Assignment Goals
Solve a pattern matching problem that arises in computational biology.
Learn about strings in C.
Learn about
hashing
.
Checking Your Work and Hints
You might find the following
demo
helpful if you don't understand the statement of the problem.
Some
hints on getting started
are available here.
File format. For the genetic sequence, you should ignore all characters other than
'a'
,
'c'
,
'g'
and
't'
. For the protein sequence, you should ignore all characters other than
'A'
through
'Z'
. Some of the input files contain other characters, including newlines, spaces, and numbers, so be sure to ignore these when reading in the two sequences.
Various test protein and genetic input files are located at
/u/cs126/files/gene/
. The solution for the example data in
prot.1
and
gene.1
is
gene.1.ans
; the solution for
prot.3
and
gene.3
is
gene.3.ans
; the solution for
prot.3
and
gene.2
is "
NOT FOUND
". Your program should behave properly even if there is no match.
You may use the executable
gene126
to test your solutions.
Submission and readme
Submit the following files: readme.txt gene.c
The
readme.txt
file should contain the following information. Here is a
template readme file
.
Name, precept number, high level description of code, any problems encountered, and whatever help (if any) your received.
Describe how you implemented
hash()
and
unhash()
.
Enrichment Links
The genetic data is actually cDNA (the coding region of DNA) not DNA; the mapping will be similar to RNA with
t
replaced by
u
if you wish to compare with your biology textbook, or the following
amino acid table
borrowed from EBB 320.
The genetic data is taken from the
National Center for Biotechnology
.
Kevin Wayne