Bioinformatics and Virus Evolution

R. Dix, C. Mulvihill, D. Spencer, B. Williamson

BEDROCK Workshop, January 2005

 

HIV and HIV disease is a rich context for student inquiry.  The topic provides everyday relevancy and urgency, social, legal and ethical dilemmas, global implications, and teaches about infectious disease, human biology, immunology, and evolution.  BEDROCK’s HIV problem space provides the tools for a rich inquiry experience.

 

 

 

Part I - Introduction to HIV and HIV Disease

 

Students first need to review facts about Human Immunodeficiency Virus (HIV) and the dynamics of HIV disease which culminates in AIDS (Acquired Immune Deficiency Syndrome).  Go to: http://www.bioquest.org/bedrock/problem_spaces/hiv/index.php

Notice the space to your left has links to an introduction and background.  Study those two sections now and follow every link within those sections.  Take note of:

·  The virus life cycle and the body cells it targets

 

 

 

·  The role of the HIV Env protein

 

 

·  What is the V3 loop?

 

 

· Summarize how virus mutation contributes to the development of AIDS in the infected person.

 

 

 

Part II- HIV Forensics

 

HIV is known for its high mutation rate (and hence, accelerated evolution) because it is an RNA virus whose replication enzyme (reverse transcriptase) does not have the proof-reading, mistake-correcting mechanism typical of DNA replicating enzymes.  Indeed, one infected individual will harbor several versions of HIV which will change over time.  This is the basis for tracking the source of HIV infections. If two individuals have virus sequences that are very similar, a common source is a logical inference. 

We will use portions of the actual data from Florida court proceedings regarding the transmission of HIV cases in connection with a dental office.  This data was retrieved from Microbes Count chapter “Molecular Forensics” by Sam Donovan.

 

One of the tools you will use for analyzing nucleic acid and protein sequence data is the Biology Workbench.  Travel to the site below and click on “Orientation to the Biology Workbench.” www.bioquest.org/bedrock/problem_spaces/hiv/tools.php  Open this file and read through it.

 

Let’s start working with Biology Workbench by copying the viral sequence taken from blood sample of infected patient and use Biology Workbench to identify the actual gene this sequence belongs to. 

 

>Unknown

GAGGTAGTAATTAGATCTGCCAATTTCACAGACAATGCTAAAATCATAATAGTACAGCTGAATGCATCTGTAGAAATTAATTGTACAAGACCCAACAACTATACAAGAAAAGGTATACGTATAGGACCAGGGAGAGCAGTTTATGCAGCAGAAAAAATAATAGGAGATATAAGACGAGCACATTGTAACATTAGTAGAGAAAAATGGAATAATACTTTAAAACAGGTAGTTACAAAATTAAGAGAACAATTTGTGAATAAAACAATAATCTTTACTCACCCCTCAGGAGGGGACCCAGAAAT

 

Directions:

 

· Log in to the Biology Workbench:  http://workbench.sdsc.edu

 

· If you do not have an account, you will need to get one by using the “Set up a free Account” link.

 

· Once you have an account and enter the site, scroll down to bottom to access five buttons – we are using nucleic acid data so click on “nucleic tools.”

 

A.  Entering Sequences into the workspace

· Now you see a window with list of tools – select “Add New nucleic sequence” and press the Run button.

 

· On the next page, enter the sequence into the “sequence” window and give it a title (use unknown) in the label window.  You can enter the sequence by typing it in, by using the browse button to select the file, or by using the Edit “paste” function after you Edit “copy” the sequence.

 

· Use the “Save” button on this page, which then takes you back to the previous page and now your sequence is in the Biology Workbench workspace.  Check the box next to the sequence called “unknown” and select the “View nucleic sequence” command and press Run to confirm that the sequence was correctly entered.*

*HINT:  press the “Return” button at the bottom of the Run page rather than using your browser’s “Back” button while working in this program.

 

B.  Identifying a sequence (is it from a known gene?)

· Return to your workspace, be sure your “unknown” sequence is check-marked; select BLASTN from the menu and press Run.  Now it asks you what database you want to search – we know this is a viral sequence, so scroll down and select GenBank Viral Sequences.  Then scroll down and select “Submit.” 

 

· The page that comes up is very long.  Look at the identity of the best (first) match by clicking on the score value to get to the alignment:  it tells you that this is: ____________________________________________________. 

 

You find out that it is an HIV sequence for sure, and then you find the notation V3 – if you did not know what V3 is in reference to HIV, you could go to www.google.com and put in “HIV and V3.”  You find out that V3 is:

 

C.  Viewing Multiple sequences

Now you can use the following data from the Florida dentist case.  In 1990 a young woman tested positive for HIV, but she had no known risk factors for getting the virus.  Investigation showed that she had had an invasive dental procedure done by an HIV-positive dentist years earlier.  It turned out that a few other patients of this dentist were also HIV-positive without identifiable risk factors.  The CDC (Centers for Disease Control) took blood samples and isolated V3 Env viral sequences from three HIV-positive patients (E, F and G), the same viral sequence from the dentist, and from two HIV+ individuals from the area who had no contact with the dentist.  The last two are controls: Local Control 3 and Local Control 22.

  

· You can now compare the sequences just by looking at them – working in your group, what sorts of patterns do you see within/between these sequences?

 

 

 

 

· How are these sequences similar or different?

 

 

 

 

 

D.  Using Multiple sequence alignment tools

To more fully answer the above questions, you can use a bioinformatics tool called multiple sequence alignment.  The ClustalW program will “align” sequences by finding the best ways to make the nucleotides in the sequences line up with one another.  To do this,

1. Use Biology Workbench to add these sequences to your workspace as you did above for the unknown.

 

>Dentist

GAGGTAGTAATTAGATCTGCCAATTTCACAGACAATGCTAAAATCATAATAGTACAGCTGAATGCATCTGTAGAAATTAATTGTACAAGACCCAACAACTATACAAGAAAAGGTATACGTATAGGACCAGGGAGAGCAGTTTATGCAGCAGAAAAAATAATAGGAGATATAAGACGAGCACATTGTAACATTAGTAGAGAAAAATGGAATAATACTTTAAAACAGGTAGTTACAAAATTAAGAGAACAATTTGTGAATAAAACAATAATCTTTACTCACCCCTCAGGAGGGGACCCAGAAAT

>PatientE

GAGATAGTAATTAAATCTGCCAATTTCACAGACAATGCTAAAATCATAATAGTACAGCTGAATGCATCTGTAGAAATTAATTGTACAAGACCCAACAACAATACAAGAAAAGGTATACATATAGGACCAGGGAGGGCATTTTATGCAACAGGAGAAATAATAGGAGATATAAGACAAGCACATTGTAACATTAGTGGAGAAAAATGGAATAATACTTTAAAACAGGTAGTTACAAAATTAAGAGAACAATTTGGGAATAAAACAATAATCTTTAATCACTCCTCAGGAGGGGACCCAGAAAT

>PatientF

GAAGTAGTAATTAGATCTGAAAATTTCACGGACAATGTTAAAACCATAATAGAGCAGCTGAATGAATCTGTACAAATTAATTGTACAAGACCCAACAACAATACAAGAAAAAGTATACATATAGCACCGGGGAGAGCATTTTATGCAACAGGAGAAATAATAAGAGATATAAGACAAGCACATCGTAACCTTAGTAGCATAAAATGGAATAACACTTTAAGACAGATAGCTAAAAAATTAAAAGAACAATTTGGAAATAAAACAATAATCTTTAATCAATCCTCAGGAGGGGACCCAGAAAT

>PatientG

GAGGTAGTAATTAGATCTGCCAATTTCACAGACAATGCTAAAATCATAATAGTACAGCTGAATGCACCTGTAGAAATTAATTGTACAAGACCCAACAACAATACAAGAAAAGGTATAAGTATAGGACCAGGGAGAGCATTTTATGCAACAGATAGAATAGTAGGAGATATAAGAAAAGCATATTGTAACATTAGTAGAGAAAAATGGAATAATACTTTAAAACTGGTAGTTACAAAATTAAGAGAACAATTTGTGAATAAAACAATAATCTTTAATCACTCCTCAGGAGGGGACCCAGAAAT

>LocalControl3

GAGGTAGTAATTAGATCTGAAAATTTCACGGACAATACTAAAACCATAATAGTACAGCTAAATACATCTGTAACAATTAATTGTACAAGACCTGGCAACAATACAAGAAAAAGTATAACTATGGGACCGGGGAAAGTATTTTATGCAGGAGAAATAATAGGAGATATAAGACAAGCACATTGTAACCTTAGTAGAACAGCATGGAATGACACTTTAGAACAGATAGTTGGAAAATTACAAGAACAATTTGGGAATAAAACAATAGTCTTTAATCACTCCTCAGGAGGGGACCCAGAAAT

>LocalControl22

GAGGTAGTAATTAGATCTGACAATTTCTCGGACAATGCTAGAACCATAATAGTACAGCTGAACGAATCTGTAGTAATTAATTGTACAAGACCCAACAACAATACGAGCAGACGTATAAGTATAGGACCAGGGAGAGCATTTACTGCAAGAGAAGGAATAATAGGAGACATAAGACAAGCACATTGTAACATTAGTGGAGCAGAATGGGAAAGCACTTTAAAACGGATAGTTGAAAAATTAGGAGAACAATTTAAGAATAAAACAATAGTCTTTAATCACTCCTCAGGAGGGGACCCAGAAAT

  1. Once you have entered all these sequences into your workspace, click the boxes to select all the sequences and do a ClustalW, Run, which does a multiple sequence alignment.
  1. *Hint:  If your ClustalW run in Biology Workbench gives a warning that your sequence has 0 bps, go back to your sequence, select “Edit nucleic sequence” and be sure you remove any blank space between end of first line and beginning of second line of sequence.

 

· Does the information in the multiple sequence alignment verify patterns you saw by looking at the raw sequences?  Summarize what the alignment shows in your own words:

 

 

 

 

 

 

· Does the multiple sequence alignment show you additional patterns?  Elaborate:

 

 

 

 

 

 

·  Note that local control 3 had a gap of three nucleotides in the sequence – what would this represent at the protein level?

 

Looking at six alignments is time-consuming – another way to compare sequences and look at their relatedness is the pairwise alignment scores.  The % identity scores you see takes two sequences, counts the number of identical positions and divides by total number of positions.  Look at these values and list the three highest identity percentages and sequences:

 

 

 

Another way of comparing sequences is a distance tree.  The distance tree uses the sequence data to create genetic distances between sequences as lengths between the tips of branches.  Note that you do NOT expect that HIV acquired from one person would be identical to the HIV in the recipient due to the rapid mutation rate of HIV.  However, you do assume that the more similar viral sequences are, the more likely they come from a recent common ancestor virus.  Does the distance tree imply directionality?

 

Looking at the distance tree and the pairwise similarity scores, make a summary of an  argument to judge and jury regarding the claims from patients F, G, and E that they acquired HIV from the dentist. Be sure to include data from local controls in your argument. 

 

 

 

 

 

 

 

 

 

 

 

 

Part III- Student Exploration of HIV Evolution.

 

Now for the creative part – you, the student, will use a set of data taken from a study of HIV V3 sequences over time from a group of IV-drug users in Baltimore, and you will also have CD4 counts to go with the viral sequence data.    The data came from a paper by R.B. Markham and colleagues, published in PNAS 95: 12568, 1998.  The data has been trimmed from the original set to provide first and last visits from subjects that started with normal CD4 counts and ended with below-normal CD4 counts where sequences were available.  Start with the table below to look over the data.  An expanded data set can be found at http://bioquest.org/bedrock/problem_spaces/hiv/sequence_data.php

 

There are many questions that could be addressed with this data set, and it is up to you to formulate a hypothesis and use the data to test that hypothesis, using the tools of Biology Workbench.

 

Happy prospecting!

 

 

 

 

 

 

Data Summary

Number of subjects:  4 of 15

Visits per subject:  only first and last shown.

 

Subject

Visit

Visit Date

# CD4s

# sequences

1

1

5

2/9/89

9/2/92

464

14

11

13

3

1

6

2/21/91

10/27/93

819

47

4

6

10

1

6

5/21/91

11/9/93

833

17

7

10

15

1

4

6/5/89

4/11/91

707

10

12

10