README.txt

7d62705b · Mika Cankosyan · 15b2425c · 7d62705b · 7d62705b
Commit 7d62705b authored 2 months ago by Mika Cankosyan
--- a/README.txt
+++ b/README.txt
+Table of contents
+
+Introduction
+Program description
+How to run
+
+Introduction
+
+The Nussinov algorithm is one of the earliest well-known algorithms for RNA secondary structure prediction. Given an RNA primary sequence, it finds 
+the secondary structure with the maximum number of complemetary base-pairs and without any pseudoknots. It is a very simple algorithm that uses
+dynamic programming. It works best for short sequences and for structures without pseudoknots, but struggles with longer ones and with pseudoknots.
+It is deterministic, meaning it will output the same secondary structure every time. This is good for consistency and ease of debugging, but 
+it means that it can't give probabilities for multiple different structures. it is possible for some RNA molecules to take on multiple different
+secondary structures. Since it is a simple algorithm that only "cares" about number of complementary base-pairs, it doesn't account for other
+biological considerations, thermodynamic stability, etc
+
+Program description
+
+This program implements both the original Nussinov algorithm, as well as an optimized and probabilistic forms of it. The original version uses 
+dynamic programming. In a nutshell, given an RNA primary sequence, it constructs a dynamic programming table and a backtrace table, and then 
+uses the backtrace table to backtrace the optimal secondary structure (i.e. the one with maximum number of base-pairs). In the optimized version,
+instead of just giving a score of 1 for complementary (i.e. A-U and G-C) base-pairs, it assigns custom scores for those as well as another possible
+base-pair, G-U. By default, A-U is given a score of 1, G-C 1.5, and G-U 0.5, but these can be adjusted to what works best. In the probabilistic 
+version, instead of storing values in the backtrace to generate only the one optimal path, it stores multiple paths in each entry and backtraces
+probabilistically, which basically will output a secondary structure probabilistically weighted by its score
+
+How to run
+
+Start the program with a python interpreter, e.g. "python3 nussinov.py". You will be asked to enter the RNA primary sequence and whether you want 
+to use the optimized and/or probabilistic version. Then it will display output including the dynamic programming table and the optimal or 
+probabilistically chosen secondary structure, displayed in two different ways. If you enter something wrong it will ask you to run and try again
+
+See comments in nussinov.py for more information on certain things
\ No newline at end of file
--- a/nussinov.py
+++ b/nussinov.py
@@ -24,8 +24,8 @@ def exit_with_error(error):
 # just give u all the tied best scores, whereas running with "optimal" always chooses the same one of the tied-best scores/structures)
 def scores_to_weights(scores):
    return [(score + 1) ** 3 for score in scores]
-    # cubed to make good scores especially likely, + 1 bc we can't have all weights = 0, and the + 1 before cubing so that 
-    # values 0 < v < 1 are still appropriately scaled up
+    # cubed to make good scores especially likely, + 1 bc we can't have all weights = 0 and so that 
+    # scores 0 < v < 1 are still appropriately scaled up

 def is_valid_rna_sequence(rna_sequence):
    valid_chars = {'A', 'U', 'G', 'C'}
@@ -56,7 +56,6 @@ def probabilistic_bt_to_chosen_bt(bt, i, j):
    length = j - i + 1
    while (bt[i][j] != []):

-        # print(i, j, bt[i][j]) # debugging
        paths = [path for path, _ in bt[i][j]]
        scores = [score for _, score in bt[i][j]]
        weights1 = scores_to_weights(scores)
@@ -71,7 +70,6 @@ def probabilistic_bt_to_chosen_bt(bt, i, j):
            i += 1
            j -= 1
        else: # bifurcation
-            # print(bt[i][j]) # debugging
            probabilistic_bt_to_chosen_bt(bt, i, bt[i][j])
            probabilistic_bt_to_chosen_bt(bt, bt[i][j] + 1, j)
            return bt
@@ -80,7 +78,6 @@ def probabilistic_bt_to_chosen_bt(bt, i, j):

 # returns a list consisting of '(', ')', and '-' characters
 def bt_to_coded_list(bt, i, j):
-    # print(i, j) # debugging
    length = j - i + 1
    og_i = i
    coded_list = list('-' * length)
@@ -251,15 +248,13 @@ dp_table, bt = nussinov(rna_sequence, optimized, probabilistic)

 print("\n\nthe dynamic programming table:\n")
 print_2d_array(dp_table)
-# print(bt) # debugging

 if (probabilistic):
    bt = probabilistic_bt_to_chosen_bt(bt, 0, len(rna_sequence) - 1)
-    # print(bt) # debugging
    print("\n\nthe probabilistically chosen secondary structure, in hyphen-parentheses notation, and as a list of base-pairs", \
          "each key-value pair is a base-pair)\n")
 else:
-    print("\n\nthe optimal secondary structure, in hyphen-parentheses notation, and as a list of base-pairs (each key-value pair is a base-pair)\n")
+    print("\n\nthe optimal secondary structure, in hyphen-parentheses notation, and as a list of base-pairs (each key-value pair is a base-pair):\n")

 coded_list = bt_to_coded_list(bt, 0, len(rna_sequence) - 1)
 base_pairs = bt_to_base_pairs(bt, 0, len(rna_sequence) - 1, {})