Computer Workshop

For this workshop, you will need to access the internet as well as access to a free software available to download online. Make sure you can open the programs before you start. You will be using a website run by the National Center for Biotechnology (NCBI), which is used extensively by biologists.  Tools on this site include PubMed, which is used to search the scientific literature for papers, and BLAST, which is used to find similarity between an entered nucleotide or protein sequence and the databases.  All sequences that biologists publish are added to the databases.  These may be from individual research labs or from genome projects. For this activity, you will need the following programs:

            •           SeaView (http://doua.prabi.fr/software/seaview)

            •           PAML-X (http://abacus.gene.ucl.ac.uk/software/paml.html)

For this activity you will be assigned a protein and go through a pipeline to determine the function of the gene, the evolutionary history of the gene, and whether there are locations in the gene that are undergoing adaptive evolution or purifying evolution.

            •           There are several genes available for you to choose from at the front of the class. Pick one! Copy the sequence here

            •           First, let’s figure out what your gene is. Go to:  http://www.ncbi.nlm.nih.gov/BLAST/.  Under Web BLAST, click on Nucleotide blast (BLASTn). Type or paste your gene sequence into the Enter Query Sequence box (all of the sequences are on BlackBoard so you can copy it from there). Click the blue BLAST box at the bottom of the page.  In a few seconds to a few minutes your search results will be displayed.  At the top of the page there is a summary of conserved domains if present and a graphical representation of similar proteins, in the middle of the page is the list of similar proteins, and at the bottom of the page are the actual sequence alignments.  The E value on the right of the list of proteins is a measure of how “good” the similarity is and should be very low (< 1e-4).

            •           What protein is your gene likely to encode?  

            •           What species did your gene likely come from?

            •           What is the function of your gene? (You may need to do some extra Googling)

            •           Copy the CODING REGION ONLY for your gene into a notepad (or equivalent text editing) file. DO NOT USE WORD!! You will be saving this file as a FASTA file which is used to organize genetic sequences. For an example of FASTA format, see https://zhanggroup.org/FASTA/#:~:text=A%20sequence%20in%20FASTA%20format,than%2080%20characters%20in%20length.

In general, FASTA format is:

>name1

Sequence1

>name2

Sequence2

            •           Return to your BLASTn results. Include the sequences that are the most similar matches to your sequence in your FASTA file using the appropriate formatting. Make sure the names you pick for the sequences are <8 characters and have no spaces. In order to find the nucleotide sequences from the results, click on the GenBank link. The following page will have a link to the FASTA sequence. Make sure to not duplicate any species. Save at least 5 separate species/sequences in your FASTA file, though keep in mind that 10 species will provide more robust results. Make sure to save your file with the extension *.fst. Copy your FASTA file contents below:

Of special note: make sure that you are only saving the coding region of the sequence! This means it should start with a start codon and end with a stop codon. The total length should also be a multiple of three!!

            •           Now that you have homologous sequences from multiple species, you can begin to align the sequences and construct a phylogenetic gene tree. To do this, you will use SeaView.

            •           Import your sequences into SeaView

            •           File->Open FASTA->filename.fst

            •           Before you align your nucleotide sequence, you will need to convert your sequences into amino acids within the program. This allows the alignment to properly compare equivalent portions of the gene to one another across species and prevents problems with gaps. To do this, click the tab “Props” and check “view as proteins”

            •           Click Align->Align all

            •           Wait until the process finishes

            •           Click Okay, observe alignment

            •           Save the alignment as a new *.fst file

            •           Now that you have the alignment file, you can use it to construct a phylogenetic tree. There are four main ways to construct trees:

            •           Parsimony

            •           Assumes least amount of evolutionary change to make tree (Occam’s razor)

            •           Distance methods

            •           Measures “genetic distance” between sequences, ones closest together are grouped in phylogenetic tree

            •           Maximum likelihood (PhyML)

            •           Like parsimony, but implements statistics based on prevalence of character states. Is able to estimate rate of sequence evolution. Is robust and favored in analyses

            •           Bayesian

            •           Incorporates Bayesian statistics, relies on prior knowledge and updates as more knowledge is gained through replication of data analysis (Monty Hall Problem). This methods is very robust, but takes a day or two to run. We will not be doing this one for the workshop (but you are welcome to try at home!)

            •           The tree you want for your analyses today uses the PhyML methodology. Do not do bootstraps (repeated analyses) for the sake of time. The default (GTR model) has a special statistical method that calculates them in a fraction of the time.

            •           Based on the model you are using, construct your tree by clicking:

            •           Trees->method of choice->Run/Okay/Go

            •           Wait until the analysis finishes

            •           Click Okay, observe phylogeny

            •           There are multiple ways to view each tree. All methods (except parsimony) will allow you to see overall evolutionary divergence by looking at the length of branches with “squared” view. “Circular” view allows for a different representation, though includes the same information. “Cladogram” merely presents the branching patterns with all ends lining up. Play with the options a little. Which version do you like best?

            •           You can save the phylogeny as a PDF as a visual for reporting data. Also save as an unrooted tree as “name.txt”. This will save the tree in what is called “Free Newick” format, which is a way to write out what the tree looks like using nested sets of names. Make sure to turn in your image of your tree for this assignment.

            •           Before you can measure selection pressures on your gene family, you will need to edit the text file you just created (name.txt). The file you saved indicates the relatedness of the sequences to each other, but also the distance between them. The program you will be using does not recognize the numbers and therefore will not work with your raw text file. Open the tree text file and delete all of the numbers and colons. Leave all commas, parentheses, and semi-colons!!

            •           To determine selection pressures, we use the value called dN/dS. These values represent the number of non-synonymous mutations (dN) in a sequence as compared to the number of synonymous substitutions (dS) between two sequences in relation to the number of possible substitutions. A high value indicates adaptive selection, a low value purifying selection.

            •           To test for positive selection, we will use the program PAML-X. Open PAML-X.

            •           Click on “YN00.” This part of the program conducts pairwise comparisons between each of the sequences you input.

            •           In PAML-X, input your *.fst file into seqfile. Indicate an output location. Most of the defaults are fine, however, I usually unckeck “commonkappa” and instead check “commonf3x4”

            •           Run the analysis

            •           This analysis will provide you with three different methods of calculating dN/dS. The middle method, by Yang, is considered the most robust. These numbers provided are pairwise and an average over the entire gene sequence. You will not see very high numbers due to the fact that regions in a gene evolve at different rates and experience different pressures. We will want to find a more representative measure of dN/dS. What values of dN/dS did you calculate from this step?

            •           In your output file, find where your sequence of interest is compared to each of the other sequences in your FASTA file. Make a file in a spreadsheet program (Excel or Google Sheets) and use three columns: 1) species/sequence you are comparing it to 2) the dN value 3) the dS value.

            •           Plot the dN and dS values against each other with the dN values on the x-axis and dS values on the y-axis. The slope should be entirely positive. If you have a lot of distantly related species, then the slope will flatten out. This is called dS saturation. Make a new FASTA file with only the sequences that are part of the positive slope. These will probably only be very closely related species.

            •           In PAML-X, click on CODEML. This part of the program will calculate the dN/dS ratio for every single base pair. Input you tree and seq file into CODEML. Choose seqtype as codons. Change codon frequency to option 2 (F3x4) and uncheck getSE. For NSsites, check only 8 and uncheck 0. You will need to do this analysis twice. Once you will fix omega at 1, and the other time it will not be fixed and instead estimated. Run the analysis. Each of these outputs will give you a likelihood number. You need to determine if the model (omega estimated) is significantly different from the null model (omega fixed). To do this, use the equation:

2*(ln(likelihood of model)-ln(likelihood of null))

A value of 2.71 and above represents 5% significance or lower

A value of 5.41 and above represents 1% significance or lower

Was your result significant?

            •           The output files would also provide the specific base pair numbers that are under positive selection. This is done using two different methods. The second one is the one you should look at when determining whether there is significance at the base pair level. Are any bases undergoing positive/adaptive evolution?

Do you need urgent help with this or a similar assignment? We got you. Simply place your order and leave the rest to our experts.

Order Now

Quality Guaranteed!

Written From Scratch.

We Keep Time!

Scroll to Top