Perl and the human genome project

2016-05-25 10:16:22 admin Perl Interests 0 Comments

More than twenty years ago Perl made its mark in one of the most important human endeavours. It all started in Cambridge in a meeting between the computer scientists of the largest DNA sequencing center in Europe and its homologues from the United States. That meeting was meant to solve a sharing problem. The two DNA centers used the same systems and techniques but they weren’t able to share data or compare results. But let’s put things in context first.

Backstory

This happened in 1996, six years after the human genome project got off the ground. This ambitious project was a crucial endeavour, from both a scientific and a medical standpoint. A thorough understanding of our genetic makeup is crucial to better understand different process and stages, how the nervous systems works, how organisms develop and evolve etc. The human genome project was a massive undertaking for gaining knowledge and understanding of our DNA and everything related to it.

To better understand their predicament and Perl’s role in solving it, you have to imagine the size of this project. We're talking about the human DNA, the very fabric of what we are. The project as a whole was going well, they already had plenty of progress in mapping out the DNA, completing several landmarks in this aspect. The science wasn’t the problem, size was the problem, the scale of the human DNA sequencing project.

DNA consists of strings of 4 letters, each letter represents one of the four chemicals that form the double helix, G, A,T and C. The aim of all of this was to determine the order of the letters in the string. Simple right? Not quite. The space needed for the finished data was about 3 Gb, which doesn't seem much by today's standards, but back then it was quite substantial. And this is just for the finished product. The experimental data needed a lot more. This is where the hiccup began.

The processes needed for this project were far more vast than what was available at the time. Remember this was 20 years ago, back then Windows 95 was a miracle. The amount of data was and still is staggering, the estimation was somewhere between 1 to 10 terabytes of information for the project to reach a conclusion.

Perl walks in

The moment they took one this vast project they knew that computer science was going to be a big part of its succes. Each data center had a core informatics team that had to provide computer support and database services. They were an integral part of the mission at hand, and some had good results others not so much. Through mixed results they carried on and most of them build modular systems that could be swapped without retooling the whole system.

To understand how they solved it you need to understand the process. There are several steps that you have to go through on a bit of sequenced DNA:

1 the first step is checking the quality of the sequence, taking into account factors such as length

2 the second step is a vector check that basically verifies that the DNA in the sequence is human

3 the third step is checking the sequence with other sequences existent in various databases to see is there is any match that might indicate some sort of function to that new DNA sequence

4 the last step is loading the sequence with all the relevant information to the laboratories database

All these steps look like a pipeline and with that in mind the original developers thought that a Unix pipe could handle all the necessary tasks.

Perl gets the ball rolling

A simple Perl data exchange format was developed, called “boulderio”. This program allowed coupled programs to add information to a pipe based I/O stream. This was just the beginning, additional modules were created to handle each step in the process.

This is an example of a script for DNA sequence analysis:

name_sequence.pl < new.dna |

quality_check.pl |

vector_check.pl |

find_repeats.pl |

search_big_database.pl |

Load_lab_database.pl

The givenfile, containing the new DNA sequence is processed by the "name_sequence.pl" script. This script gives the sequence a new name puts it in boulder format, so at the end we get this:

NAME=L26P93.2

SEQUENCE=GATTTCAGAGTCCCAGATTTCCCCCAGGGGGTTTCCAGAGAGCCC......

Each step that we mentioned above corresponds with a Perl script. The next step is the quality checking stage which transforms the data to this:

NAME=L26P93.2

SEQUENCE=GATTTCAGAGTCCCAGATTTCCCCCAGGGGGTTTCCAGAGAGCCC......

QUALITY_CHECK=OK

The vector stage pulls the SEQUENCE tag and runs the vector checking algorithm changing the data to this:

NAME=L26P93.2

SEQUENCE=GATTTCAGAGTCCCAGATTTCCCCCAGGGGGTTTCCAGAGAGCCC......

QUALITY_CHECK=OK

VECTOR_CHECK=OK

VECTOR_START=10

VECTOR_LENGTH=300

The process goes down the pipeline until it's complete. Each center took a different road in this process but a few of them actually got to the same basic idea, similar to bouldario. Because of the fact that each group worked independently on the same problem, thus creating their own solution, each unique, it created a problem related to interchangeability. The groups couldn't share data that they developed in house, couldn’t share solutions or software.

The meeting that we mentioned at the beginning of the article was part of a session that dealt with this interchangeability problem. They adopted a common data exchange format called CAF, creating a common ground to work on together. Thanks to Perl.

Perl’s contribution was not just to solve their technical issues related to processing and size but also to reach a common understanding of the data achieved and the progress made on each side of the pond. Yes, we hear the naysayers, that was 20 years ago, but that doesn’t take anything away from the prestige and huge contribution that Perl had to one of the most valuable and progressive projects in human history.

Leave a comment


0 Comments

Subscribe to our newsletter!

Make sure you never miss the interesting stories of Perl startups, apps and projects.