Sunday, April 12, 2009

Crafty Gits

I’m taking a class this semester called System Programming Concepts (UNO course number CSCI 2467). The class is ostensibly supposed to be about system programming, duh. It varies how much C programming, csh scripting, awk and sed using, grepping and generally UNIXy stuff is taught depending on the teacher and his disposition. I happened to have it with the lead Bioinformatics prof at UNO currently, Dr. Stephen Winters-Hilt. He’s a brilliant brilliant fellow, and his research in Bioinformatics has made him quite familiar with Perl. So, in this class, we’ve learned some basic UNIX commands, some C (all classic recursively solved problems) some csh scripting, but the real kicker is the amount of Perl we’ve had to learn. It turns out that Perl is super fast because of things like the built in hash primitive and the fact that most of the Perl that people write utilizes precompiled C to make things go faster. Forgetting the fact that your entire program looks like a big regular expression, its pretty neat, and the first problem he gave us on our midterm showed me that I should never trust search in text editors, not grep, or ctrl-f or /string (vi search). They’re all lousy. Here was his problem.

Given a virus genome, construct a Perl script that finds and counts all one base occurrences (a,c,t,g), 2 base occurrences(aa,ac,at,ag,ca,cc,ct,cg, etc.) all the way to 15 base occurrences (1 billion possible occurrences 1,073,741,824 or 4^15).

Once parsing the entire file into a single string and shoving it into an array (1 char per index) the next step is to start searching. His suggestion was to use the wonderful Perl hash primitive to track counts of bases seen thusly:

$my_hash{aatgcgtta}++

I don’t know about you, but I think that’s pretty damned hot. Sexy even. Incrementing the key of a hash so that its value is effectively the counter. NOT TO MENTION, if the key doesn’t exist, in Perl it creates the key, and sets the value to 1. Counters have never seemed so badass.

I’m not going to post my solution, in case someone taking the class stumbles upon this website (not likely) but I will say this: The problem is meant to screw with your brain. Like all good CS assignments, there is a catch that will frustrate you for days to teach you some kind of lesson.

Once I finished my solution to work for 1-base to 15-base cases, I decided to look in the text file itself to see if I was on the right track. I took one of the 3-base occurrences that my output claimed occurred 38 times. I started grepping around in vim and counting by hand. I counted less than 38. Panic-stricken, I thought that my algorithm for constructing strings was totally busted and I had to turn it in incomplete.

The following week I revisited my algorithm because it has to be redone in C and Python now. While looking at it I decided to look at the 15-base case. There was a single base that occurred twice in the 7500 character string, but grepping in the file showed only one occurrence. Perplexed I started at a long string of a’s, c’s, t’s, and g’s for a few hours and discovered that there WERE two occurrences of the string described by my algorithm. The strings overlapped. For example the base:

tttaattaggtttaa

occurs twice in the sequence below. The problem is that the second occurrence is contained in the first.

tttaattaggtttaattaggtttaatttgatgtt
tttaattaggtttaattaggtttaatttgatgtt

See for yourself. Grep, ctrl-f, /search_term WHATEVER, will not find a second string in a search if the second string is partly composed of the first string. Testing my program by using time honored means of text editor search only served to confuse the shit out of me. I don’t know if the lesson to be learned was that you can’t always trust tools provided to you, no matter how old, but that’s certainly what I took from that assignment.