Skip to content
Snippets Groups Projects
Commit 7d62705b authored by Mika Cankosyan's avatar Mika Cankosyan
Browse files

README.txt

parent 15b2425c
No related branches found
No related tags found
No related merge requests found
Table of contents
Introduction
Program description
How to run
Introduction
The Nussinov algorithm is one of the earliest well-known algorithms for RNA secondary structure prediction. Given an RNA primary sequence, it finds
the secondary structure with the maximum number of complemetary base-pairs and without any pseudoknots. It is a very simple algorithm that uses
dynamic programming. It works best for short sequences and for structures without pseudoknots, but struggles with longer ones and with pseudoknots.
It is deterministic, meaning it will output the same secondary structure every time. This is good for consistency and ease of debugging, but
it means that it can't give probabilities for multiple different structures. it is possible for some RNA molecules to take on multiple different
secondary structures. Since it is a simple algorithm that only "cares" about number of complementary base-pairs, it doesn't account for other
biological considerations, thermodynamic stability, etc
Program description
This program implements both the original Nussinov algorithm, as well as an optimized and probabilistic forms of it. The original version uses
dynamic programming. In a nutshell, given an RNA primary sequence, it constructs a dynamic programming table and a backtrace table, and then
uses the backtrace table to backtrace the optimal secondary structure (i.e. the one with maximum number of base-pairs). In the optimized version,
instead of just giving a score of 1 for complementary (i.e. A-U and G-C) base-pairs, it assigns custom scores for those as well as another possible
base-pair, G-U. By default, A-U is given a score of 1, G-C 1.5, and G-U 0.5, but these can be adjusted to what works best. In the probabilistic
version, instead of storing values in the backtrace to generate only the one optimal path, it stores multiple paths in each entry and backtraces
probabilistically, which basically will output a secondary structure probabilistically weighted by its score
How to run
Start the program with a python interpreter, e.g. "python3 nussinov.py". You will be asked to enter the RNA primary sequence and whether you want
to use the optimized and/or probabilistic version. Then it will display output including the dynamic programming table and the optimal or
probabilistically chosen secondary structure, displayed in two different ways. If you enter something wrong it will ask you to run and try again
See comments in nussinov.py for more information on certain things
\ No newline at end of file
......@@ -24,8 +24,8 @@ def exit_with_error(error):
# just give u all the tied best scores, whereas running with "optimal" always chooses the same one of the tied-best scores/structures)
def scores_to_weights(scores):
return [(score + 1) ** 3 for score in scores]
# cubed to make good scores especially likely, + 1 bc we can't have all weights = 0, and the + 1 before cubing so that
# values 0 < v < 1 are still appropriately scaled up
# cubed to make good scores especially likely, + 1 bc we can't have all weights = 0 and so that
# scores 0 < v < 1 are still appropriately scaled up
def is_valid_rna_sequence(rna_sequence):
valid_chars = {'A', 'U', 'G', 'C'}
......@@ -56,7 +56,6 @@ def probabilistic_bt_to_chosen_bt(bt, i, j):
length = j - i + 1
while (bt[i][j] != []):
# print(i, j, bt[i][j]) # debugging
paths = [path for path, _ in bt[i][j]]
scores = [score for _, score in bt[i][j]]
weights1 = scores_to_weights(scores)
......@@ -71,7 +70,6 @@ def probabilistic_bt_to_chosen_bt(bt, i, j):
i += 1
j -= 1
else: # bifurcation
# print(bt[i][j]) # debugging
probabilistic_bt_to_chosen_bt(bt, i, bt[i][j])
probabilistic_bt_to_chosen_bt(bt, bt[i][j] + 1, j)
return bt
......@@ -80,7 +78,6 @@ def probabilistic_bt_to_chosen_bt(bt, i, j):
# returns a list consisting of '(', ')', and '-' characters
def bt_to_coded_list(bt, i, j):
# print(i, j) # debugging
length = j - i + 1
og_i = i
coded_list = list('-' * length)
......@@ -251,15 +248,13 @@ dp_table, bt = nussinov(rna_sequence, optimized, probabilistic)
print("\n\nthe dynamic programming table:\n")
print_2d_array(dp_table)
# print(bt) # debugging
if (probabilistic):
bt = probabilistic_bt_to_chosen_bt(bt, 0, len(rna_sequence) - 1)
# print(bt) # debugging
print("\n\nthe probabilistically chosen secondary structure, in hyphen-parentheses notation, and as a list of base-pairs", \
"each key-value pair is a base-pair)\n")
else:
print("\n\nthe optimal secondary structure, in hyphen-parentheses notation, and as a list of base-pairs (each key-value pair is a base-pair)\n")
print("\n\nthe optimal secondary structure, in hyphen-parentheses notation, and as a list of base-pairs (each key-value pair is a base-pair):\n")
coded_list = bt_to_coded_list(bt, 0, len(rna_sequence) - 1)
base_pairs = bt_to_base_pairs(bt, 0, len(rna_sequence) - 1, {})
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment