Next-Generation Sequencing and its Applications in Biomedical Research
The so-called “next-generation” sequencing (NGS) technologies allows us, in a short time and in parallel, to sequence massive amounts of DNA, overcoming the limitations of the original Sanger sequencing methods used to sequence the first human genome. In this talk Francesc Lopez, Director of Bioinformatics at Yale, provides an overview of NGS applications with specific examples from Mendelian genomics and cancer research.
- Next Generation Sequencing (NGS)
- Sanger’s First Sequencing Method
- Growth of Cutting Edge Technology
- The Exome-Sequencing Process
- Discovering Mutations
- A New Discovery: Novo Mutations
Francesc Lopez, Director of Bioinformatics, Yale
Read the Full Transcript
Next Generation Sequencing (NGS)
Hi, everyone. Thank you very much for the kind introduction and thank you to H20 for inviting me here. It's been quite an interesting learning process these two days. I'm going to talk about next generation sequencing. This is a name that we give a few technologies that we use to sequence DNA, to sequence genomes. I hope that sounds exciting to you to keep listening. And I'm going to talk about a couple of applications that we use in medical research. Let me start giving you a small introduction about a brief history of DNA sequencing. In 1953, Watson and Craig, they discovered the DNA structure and they won the Nobel Prize for that. Probably you are familiar with the iconic image of this DNA. The double helix structure. 20 years later, we got the first sequence. So, a sequence as long as 24 bases.
Sanger’s First Sequencing Method
So, 24 letters; 24, A, C, T, and G combinations. Later on, Sanger and collaborators, they created the first sequencing method that we still use, currently not using the same technique or the same methods, because in the past they used radio activity to label those bases, but currently use something more safe like forensics. A few years later, in 82, the first database where we stored DNA sequences, the Gin bang funded by NIH. Later on, we got the first automatic machines that allow us to sequence automatically. We're able to sequence 600 bases. We're getting longer now, and even paralyze 96 samples. And in 2000, using Sanger’s sequencing, we published the first human genome. A few years later, not much, we see the explosion of the next generation sequencing. This was a game changer. So, from that point to now, different technologies, different methods appear, but this was a big change in the sense that the length of the sequence are not even longer.
Some of these technologies, we just get a hundred bases, but allow us to do massive parallel sequencing. Just to give you, in a comparison, we can compare Sanger and next generation sequencing techniques. If you look at the human genome size, so the human genome is conformed by 3 billion bases, 3 billion of A, C, T and G combinations. For the Human Genome Project, it took over a decade and around 70 million to sequence the human genome. If you look at the machine below, this is a current technology we use in production in the Genome Center. This is a single machine that in three days, is able to produce 1.7 databases. So, you can go past the genome, you can pass through the human genome hundreds of times. So, if you go to the genome center and you request to sequence the genome, they're probably going to charge you around $2,000 to get a full genome, high quality, full genome. So, now that I mentioned the Genome Center there are 25 people running in it. We are running several machines. We also have bioinformaticians who do the analysis of that data. And for the analysis of that data, we need our own dedicated cluster, a cluster of 3.5 petabytes of storage and 4,500 courses. This is the new one. We retire the old one, and we actually run out of space to total the data we've been producing at the old one.
In the center, we try to keep with the technology, and be in the cutting edge of the technology. On the left side, we have the Illumina machines that I showed you before. We have seven of them, and that we use for production. And on the right side, we have the next of the next generation. So, here is where the name started with silly or the third generation, I dunno how to even call it. So, the Illumina machines are one that you can have on top of a desk or on top of a bench. For example, the back bio that weighs the ton. When I saw that machine, I thought we were going backwards. I thought, "how is that? We are getting smaller insights, but we get this as a new technology?" Until I saw this:
Growth of Cutting Edge Technology
So, in 2012 Oxford Nano Port introduced this sequencing machine of the size of a USB drive. A little bit bigger than a US drive that can be plugged directly into a computer, and you can do sequencing. You can sequence a small genome there. They claim that you can have in a greet, a few of them, and in 15 minutes you would get a human genome sequence. We don't use it for production. It is still not ready after four years, but it clearly tells us how the future is going. So, we are going to be able to probably get our laptop, go to the field, put a drop of light in one of those sequencers, and get a human sequence in 15 minutes, as they say. So, this is a graphic of the production of the, of the recent years. So, now we are currently producing a hundred thousand gigabases per year, more or less. But don't imagine that we only sequence genomes. There are different technologies; different ways things we can sequence, and only 5% of our production is dedicated to genomes.
The Exome-Sequencing Process
The production basically is done for exome sequencing. So, let me introduce, "What is Exome: The Exome-Sequencing Process". So, only 1.5% of our genome codes for proteins; so, 1.5% are genes. The rest of the genome, sometimes we name it as young DNA. And we basically care, or mostly care about the genes because they're going to harbor most of the mutations that are going to cause this disease. So, at the genome center, we develop with companies a technique to do that. To only sequence that 1.5%, and that's what we call the whole exome. As you can imagine, it's going to be cheaper than sequencing a genome. You can go to the sequence and center and ask, "can I get my genes B sequenced?" And currently, for $300, you're going to get your gene sequence. This is in a scheme of a diagram of how the process goes. But basically we extract the DNA, we fragment the DNA, we use a commercial key that is going to select that 1.5% of your genome, and then we put it in a machine. The machine is going to give us the sequence and it's going to go through an analysis pipeline hoping that we found a mutation responsible for the disease we're looking for.
So, this is the output of the machine. So, a way to represent it, a simple way, is what we call the FastQ format. So, they give us the machine, give us an ID for every sequence, the string of A, C, T, and Gs, which is the sequence itself. We currently work with a hundred bases where we work with the Illumina machines. And just below is a way to encode what we call the quality score. How confident we are that the base we are calling is the correct one. It just isn't code in ASCII code. So, and of course it's not, we only get a single sequence, this machine speeds out like something like five billions of those sequences every time we do a run. So, you can imagine how much data we get out of it.
So, what do we do with it? So, there are different applications, but most of them, or a big majority of them, the first step is taking those sequences and mapping back to the human genome, for example. So, on the top row, here is a graphical representation of that. On the top row, we have the reference human genome. In this case chromosome 16 is in a specific position. We encode it in different colors, so it's easy to understand. So, in red we have the A's and yellow, the T and green, the G, and in blue, the C's, and each line below it is one of those sequences that we align to the reference. So, as you can see, the columns are perfectly aligned with one exception. So, if you look in the middle, you will see that there's one column that has reds and blues.
So, we found a mutation. So, and it's beautiful how you can see that 50 of them are blue and 50 of them are red. Most probably you, as you know, receive information from your dad, 50% comes from your dad and 50% comes from your mom. So, most probably C's comes from one, and the other comes from the other parent. So, we have the mutation. What do we do with it? We cannot do much without knowing what that mutation is located, what it does. So, the next step, what we do is quite different databases. And see, for example, is this mutation located in a gene or is it outside of a gene. We can look at different databases available that is a frequent mutation in the population. If it's frequent it's going to tell us that most probably is not disease causing. What happened in other species? Do the other species?
This mutation is conserved; is a position that is highly conservative evolution. So, it's protected to some extent. What other algorithms or software tell us about it? It is what we call a damaging mutation for the protein. It's going to break the protein. It has been seen before as being a disease causing mutation. So, once you do that, you start having information. So, you collect that for every single mutation you find in your sequences. Although it's easier said than done. Sequencing a genome is simple, but finding the disease causing it is not so much. I'm taking this from a nature paper where this sequence is the first case that they use a whole genome sequencing for clinical purposes. They sequence fraternal twins, so not identical with the same disease, same disorder. They sequence the genome or both of them.
So, they started with 6 billion bases, and then they compare them. And they found that they have 1.6 million common mutations between the both twins. Out of those, 9,000 were encoding proteins in the genes. Out of those only 4,000 changed the protein. So, the others were silent mutations. It wouldn't change the protein. So, nothing would happen. Out of those 77, were rare, were uncommon in the population. So, they're good candidates for disease causing. Out of those, three look like good candidates for being disease causing. And finally, following extra lab work, they figured out that only one of them was responsible for the disease that those two kids had. And most importantly, they weren't responsive to the treatment, to the initial treatment. Thanks to this clinical research, they changed the treatment to improve the outcome. And I think that's an important topic, right?
So, we send this as a hard topic reason, especially with Obama precision medicine. So, there's a clear goal now. There is a lot of money to be able to use the genomics era to tailor medicine to that, right? So, to change treatments to improve the outcome. So, I'll show you that we can use genomics to discover the reasons why we have a disease. We can do the diagnosis with that. We can even imagine if we have a tumor. Not all tumors are identical, especially if you have lung cancer. So, different tumors are going to look different. So, we are able to sub-classify tumors. We can give you chances of surviving. And as it's said, with precision medicine, for example, in the case of lung cancer, we can adapt the treatment.
So, as you know, the reaction, when we apply medicine to lung cancer, for example, that not everybody shows the same reaction, right? So, some of them respond to the medicine, others they don't respond. So, but if we are able to know why they don't respond, and change the treatment of the quarterly, so they have better outcomes. So, this is the first case. One of the first cases that we use whole exome sequencing for clinical, and also for changing the outcome. This was a case of six kids that they didn't respond to the treatment. We found the mutation that was responsible for this gene called SLC26A3. And after that they were able to change the treatment and improve the outcome for the six kids.
A New Discovery: Novo Mutations
So, I'm running out of time, but I'm going to give you a couple more examples. So, not all mutations that we have are inherited. So, we can get what we call the Novo mutations. So, they appear in your system as a De Novo, it's new. So, you didn't inherit from your parents. So, in this case, we took three years. We took patterns that were healthy, and a kid that showed heart disease. And we looked for the novos in all those three years. And we were able to find a set of genes that were the problem. But also, we discovered another thing, or we realized about a problem that we have in our analysis pipelines. So, on average, each one of us have a De Novo mutation in genes. Most of the time, that mutation does nothing, and except in some cases that can cause serious problems for your health.
But in our analysis, out of those 362 kits, we found 2,000 De Novos in genes. So, we clearly have our ratio of false positives. Even it does, it looks on the graphic that I show you before it looks, oh, clearly, yeah, you have a mutation there. We have six times more mutations than expected. We found six more mutations in Novos than expected. So, there are a lot of false positives. So, we need to change our algorithms or we have to work hard to find the actual responsible for congenital heart disease. This is another case in which we didn't have families. So, the solution for finding hypertension responsible genes. So, we went to get big chords of individuals sharing the same problem. In this case, hypertension and finding healthy controls, and we compare them. So, just to give you an example, you may have, in our cases or in our patients, you may have two genes.
A and B. B is showing more mutations in your cases. And you may think, "oh, probably is Gene B the one responsible", if you try to do some statistical analysis there. But when you compare with controls, you realize Gene B, in the population, in the novo population already shows a lot of mutations. But it's Gene A who is not tolerant to having mutation, and most probably is the one responsible. And just to finalize, I'm running out of time. All these investigations, or most of the investigations that I just show you, are sponsored by or are funded by NIH through what we call the Center of Mendelian Genomics. Yale is one of those four centers. We are trying to find the genes responsible for what we call rare Mendelian disorders. There are 6,000 rare Mendelian disorders out there. They're rare in the sense that only few individuals are affected, but when you add that up more than 25 million (only in the US) are affected by those diseases. So, like the Human Genome Project all this is for a final goal, which is understanding human disease. And thank you very much. I think if you have questions, I'll take another. Thank you.