Jump to content

Edit filter log

Details for log entry 38,033,438

23:28, 17 June 2024: Lindsey40186 (talk | contribs) triggered filter 550, performing the action "edit" on Fast statistical alignment. Actions taken: Tag; Filter description: nowiki tags inserted into an article (examine | diff)

Changes made in edit

{{Infobox Software|name=FSA|developer=Robert Bradley ([[UC Berkeley]]), Colin Dewey ([[UW Madison]]), [[Lior Pachter]] ([[UC Berkeley]])|latest_release_version=1.5.2|operating_system=[[UNIX]], [[Linux]], [[Apple Macintosh|Mac]]|genre=Bioinformatics tool|licence=Open source}}
{{No inline|date={{subst:CURRENTMONTHNAME}} {{subst:CURRENTYEAR}}}}{{Infobox Software|name=FSA|developer=Robert Bradley ([[UC Berkeley]]), Colin Dewey ([[UW Madison]]), [[Lior Pachter]] ([[UC Berkeley]])|latest_release_version=1.5.2|operating_system=[[UNIX]], [[Linux]], [[Apple Macintosh|Mac]]|genre=Bioinformatics tool|licence=Open source}}


'''Fast statistical alignment''' or '''FSA''' is a [[multiple sequence alignment]] program for aligning many proteins or RNAs or long genomic DNA sequences. Along with [[MUSCLE (alignment software)|MUSCLE]] and [[MAFFT]], FSA is one of the few sequence alignment programs which can align datasets of hundreds or thousands of sequences. FSA uses a different optimization criterion which allows it to more reliably identify non-homologous sequences than these other programs, although this increased accuracy comes at the cost of decreased speed.
'''Fast statistical alignment''' or '''FSA''' is a [[multiple sequence alignment]] program for aligning many proteins, [[RNA]]<nowiki/>s, or long genomic [[Nucleic acid sequence|DNA sequences]]. Along with [[MUSCLE (alignment software)|MUSCLE]] and [[MAFFT]], FSA is one of the few sequence alignment programs which can align datasets of hundreds or thousands of sequences. FSA uses a different optimization criterion which allows it to more reliably identify non-homologous sequences than these other programs, although this increased accuracy comes at the cost of decreased speed.


FSA is currently being used for projects including sequencing new worm genomes and analyzing ''in vivo'' transcription factor binding in flies.
FSA is currently being used for multiple projects, including sequencing new worm genomes and analyzing ''in vivo'' transcription factor binding in flies.


== Input/Output ==
== Input/Output ==


=== Pair Hidden Markov Model for generating posterior probabilities===
=== Pair Hidden Markov Model for generating posterior probabilities===
The algorithm starts first by determining [[posterior probabilities]] of alignment <math>\mathbb{P}(A|X, Y)</math> between any two random sequences from the pool of sequences being aligned. The posterior probabilities for each column reinforce the prediction of alignment probability between a sequence pair and also filter out columns that can be unreliably aligned. These probabilities also allow for the prediction and estimate of homology between any sequence pair. A standard five-state pair [[hidden Markov model]] (Pair HMM) is used to determine these posterior probabilities of alignment for any two input sequences. The Pair HMM model uses two sets of Delete (D) and Insert (I) states to account for symbol deletion and insertions between two aligned sequences, but it can also have three states without a significant loss of accuracy.
The algorithm starts first by determining [[posterior probabilities]] of alignment <math>\mathbb{P}(A|X, Y)</math> between any two random sequences from the pool of sequences being aligned. The posterior probabilities for each column reinforce the prediction of alignment probability between a sequence pair and also filter out columns that can be unreliably aligned. These probabilities also allow for the prediction and estimate of homology between any sequence pair. A standard five-state pair [[hidden Markov model]] (Pair HMM) is used to determine these posterior probabilities of alignment for any two input sequences. The Pair HMM model uses two sets of Delete (D) and Insert (I) states to account for symbol deletion and insertions between two aligned sequences, but it can also have three states without a significant loss of accuracy.


Since the number of pairwise comparisons needed to determine the posterior probability distributions of any two pairs of sequences is computationally expensive and quadratic in the amount of sequences that are being aligned, it is decreased by using a randomized approach inspired by the Erdos-Renyi theory of random graphs. This significantly reduces the runtimes of datasets and the computational cost of running the multiple alignments.
Since the number of pair-wise comparisons needed to determine the posterior probability distributions of any two pairs of sequences is computationally expensive and quadratic in the amount of sequences that are being aligned, it is decreased by using a randomized approach inspired by the Erdos-Renyi theory of random graphs. This significantly reduces the runtimes of datasets and the computational cost of running the multiple alignments.


=== Merging Probabilities ===
=== Merging Probabilities ===


=== Sequence Annealing ===
=== Sequence Annealing ===
Most existing programs that run multiple sequence alignment algorithms are based on progressive alignment where the process starts with a “null alignment, a state where none of the sequences have been aligned. The pool of sequences is then aligned either through pairwise comparisons or through an alignment of a pair of partial alignments of subsequences. This process can cause issues in alignment because the resulting multiple sequence alignment can and will heavily depend on the sequences that are aligned at the start. There is no realigning of previously aligned sequences that could correct the MSA.
Most existing programs that run multiple sequence alignment algorithms are based on progressive alignment where the process starts with a "null alignment," a state where none of the sequences have been aligned. The pool of sequences is then aligned either through pairwise comparisons or through an alignment of a pair of partial alignments of subsequences. This process can cause issues in alignment because the resulting multiple sequence alignment can and will heavily depend on the sequences that are aligned at the start. There is no realigning of previously aligned sequences that could correct the MSA.


FSA uses the sequence annealing technique to overcome this issue. The sorted posterior probabilities are used with the sequence annealing technique to generate a multiple alignment. The technique finds the alignment between two sequences that minimizes the expected distance to the truth. In this case, the distance between two sequences is the number of columns in which the character from one sequence is not homologous to the character in the same column in the second sequence.
FSA uses the sequence annealing technique to overcome this issue. The sorted posterior probabilities are used with the sequence annealing technique to generate a multiple alignment. The technique finds the alignment between two sequences that minimizes the expected distance to the truth. In this case, the distance between two sequences is the number of columns in which the character from one sequence is not homologous to the character in the same column in the second sequence.


The sequence annealing technique, by determining an alignment with the minimum expected distance to the truth, conversely finds the alignment with the maximum expected accuracy. The accuracy of an alignment depends on a “true” alignment as reference and indicates the fraction of columns where the sequences are homologous. This accuracy is then used as an objective function that starts with the unaligned sequences (null alignment) and aligns characters in different columns based on the increasing accuracy of an alignment.
The sequence annealing technique, by determining an alignment with the minimum expected distance to the truth, conversely finds the alignment with the maximum expected accuracy. The accuracy of an alignment depends on a "true" alignment as reference and indicates the fraction of columns where the sequences are homologous. This accuracy is then used as an objective function that starts with the unaligned sequences (null alignment) and aligns characters in different columns based on the increasing accuracy of an alignment.


=== Ordering of the alignment ===
=== Ordering of the alignment ===
FSA aligns multiple sequences based on homology within columns instead of strictly a consideration of indels and substitutions. As such, FSA considers alignments to be equivalent if for every position along the sequences in both alignments, the same statement about homology can be made. For example, when considering pairwise comparisons, if there is a gap at a specific position in two alignments, then it can be said that the two sequences being compared are not homologous at said position. This can result in alignments where gap-open events can differ and yet still be considered equivalent. As such, FSA chooses to output the alignment in which there is a minimum amount of “gap openings.
FSA aligns multiple sequences based on homology within columns instead of strictly a consideration of indels and substitutions. As such, FSA considers alignments to be equivalent if for every position along the sequences in both alignments, the same statement about homology can be made. For example, when considering pairwise comparisons, if there is a gap at a specific position in two alignments, then it can be said that the two sequences being compared are not homologous at said position. This can result in alignments where gap-open events can differ and yet still be considered equivalent. As such, FSA chooses to output the alignment in which there is a minimum amount of "gap openings".


== Parallelization ==
== Parallelization ==
To handle overly large datasets, FSA is able to divide the work of running all necessary pairwise comparisons and alignments to different processors. This is handled by using a “fixed-size chunking” strategy that distributes the pairwise comparisons to each available processor in chunks. Each processor is therefore able to run the posterior probability calculation on a chunk of pairwise comparisons before merging the collected data back to a single processor for sequence annealing.
To handle overly large datasets, FSA is able to divide the work of running all necessary pairwise comparisons and alignments to different processors. This is handled by using a "fixed-size chunking" strategy that distributes the pairwise comparisons to each available processor in chunks. Each processor is therefore able to run the posterior probability calculation on a chunk of pairwise comparisons before merging the collected data back to a single processor for sequence annealing.


== Visualization ==
== Visualization ==
The results of the multiple sequence alignment under FSA can be displayed under the FSA's own GUI. The GUI is able to display and color label different measures of alignment quality on the columns of characters within the alignment itself. The five different measures that can be observed and are approximated under the FSA model include accuracy, sensitivity, certainty, specificity, and consistency.
The results of the multiple sequence alignment under FSA can be displayed under the FSA's own GUI. The GUI is able to display and color label different measures of alignment quality on the columns of characters within the alignment itself. The five different measures that can be observed and are approximated under the FSA model include accuracy, sensitivity, certainty, specificity, and consistency.


== Comparisons to other alignment programs ==
== Comparisons to other programs ==
FSA has been benchmarked against multiple alignment databases for protein (SABmark 1.65 and BAliBASE 3), RNA (BRAliBase 2.1 and Consanmix80), and DNA sequences. These benchmarks were conducted alongside other popular alignment programs such as ClustalW, MAFFT, MUSCLE, T-Coffee, and so on. Overall, at the time that FSA's abstract and research paper was received for review, FSA outperformed most alignment programs in accuracy and positive predictive values with sensitivities being on-par with the better-performing programs such as MAFFT and ProbConsRNA. Runtime comparisons were also conducted by comparing the timings to align 16S ribosomal sequences. MAFFT performed the alignment faster than the other alignment programs while MUSCLE and FSA (using a 3-state HMM and with disabled iterative refinement) were the next fastest programs.
FSA has been benchmarked against multiple alignment databases for protein (SABmark 1.65 and BAliBASE 3), RNA (BRAliBase 2.1 and Consanmix80), and DNA sequences. These benchmarks were conducted alongside other popular alignment programs such as ClustalW, MAFFT, MUSCLE, T-Coffee, and so on. Overall, at the time that FSA's abstract and research paper was received for review, FSA outperformed most alignment programs in accuracy and positive predictive values with sensitivities being on-par with the better-performing programs such as MAFFT and ProbConsRNA. Runtime comparisons were also conducted by comparing the timings to align 16S ribosomal sequences. MAFFT performed the alignment faster than the other alignment programs while MUSCLE and FSA (using a 3-state HMM and with disabled iterative refinement) were the next fastest programs.


Action parameters

VariableValue
Edit count of the user (user_editcount)
7401
Name of the user account (user_name)
'Lindsey40186'
Age of the user account (user_age)
472189305
Groups (including implicit) the user is in (user_groups)
[ 0 => 'extendedconfirmed', 1 => '*', 2 => 'user', 3 => 'autoconfirmed' ]
Rights that the user has (user_rights)
[ 0 => 'extendedconfirmed', 1 => 'createaccount', 2 => 'read', 3 => 'edit', 4 => 'createtalk', 5 => 'writeapi', 6 => 'viewmyprivateinfo', 7 => 'editmyprivateinfo', 8 => 'editmyoptions', 9 => 'abusefilter-log-detail', 10 => 'urlshortener-create-url', 11 => 'centralauth-merge', 12 => 'abusefilter-view', 13 => 'abusefilter-log', 14 => 'vipsscaler-test', 15 => 'collectionsaveasuserpage', 16 => 'reupload-own', 17 => 'move-rootuserpages', 18 => 'createpage', 19 => 'minoredit', 20 => 'editmyusercss', 21 => 'editmyuserjson', 22 => 'editmyuserjs', 23 => 'sendemail', 24 => 'applychangetags', 25 => 'viewmywatchlist', 26 => 'editmywatchlist', 27 => 'spamblacklistlog', 28 => 'mwoauthmanagemygrants', 29 => 'reupload', 30 => 'upload', 31 => 'move', 32 => 'autoconfirmed', 33 => 'editsemiprotected', 34 => 'skipcaptcha', 35 => 'ipinfo', 36 => 'ipinfo-view-basic', 37 => 'transcode-reset', 38 => 'transcode-status', 39 => 'createpagemainns', 40 => 'movestable', 41 => 'autoreview', 42 => 'enrollasmentor' ]
Whether or not a user is editing through the mobile interface (user_mobile)
false
Whether the user is editing from mobile app (user_app)
false
Page ID (page_id)
23776072
Page namespace (page_namespace)
0
Page title without namespace (page_title)
'Fast statistical alignment'
Full page title (page_prefixedtitle)
'Fast statistical alignment'
Edit protection level of the page (page_restrictions_edit)
[]
Page age in seconds (page_age)
469869476
Action (action)
'edit'
Edit summary/reason (summary)
'tagged: no inline citations (and only 4 sources?); ce'
Time since last page edit in seconds (page_last_edit_age)
488
Old content model (old_content_model)
'wikitext'
New content model (new_content_model)
'wikitext'
Old page wikitext, before the edit (old_wikitext)
'{{Infobox Software|name=FSA|developer=Robert Bradley ([[UC Berkeley]]), Colin Dewey ([[UW Madison]]), [[Lior Pachter]] ([[UC Berkeley]])|latest_release_version=1.5.2|operating_system=[[UNIX]], [[Linux]], [[Apple Macintosh|Mac]]|genre=Bioinformatics tool|licence=Open source}} '''Fast statistical alignment''' or '''FSA''' is a [[multiple sequence alignment]] program for aligning many proteins or RNAs or long genomic DNA sequences. Along with [[MUSCLE (alignment software)|MUSCLE]] and [[MAFFT]], FSA is one of the few sequence alignment programs which can align datasets of hundreds or thousands of sequences. FSA uses a different optimization criterion which allows it to more reliably identify non-homologous sequences than these other programs, although this increased accuracy comes at the cost of decreased speed. FSA is currently being used for projects including sequencing new worm genomes and analyzing ''in vivo'' transcription factor binding in flies. == Input/Output == This program accepts sequences in [[FASTA format]] and outputs alignments in [[FASTA format]] or [[Stockholm format]]. == Algorithm == The algorithm for the aligning of the input sequences has 4 core components. === Pair Hidden Markov Model for generating posterior probabilities=== The algorithm starts first by determining [[posterior probabilities]] of alignment <math>\mathbb{P}(A|X, Y)</math> between any two random sequences from the pool of sequences being aligned. The posterior probabilities for each column reinforce the prediction of alignment probability between a sequence pair and also filter out columns that can be unreliably aligned. These probabilities also allow for the prediction and estimate of homology between any sequence pair. A standard five-state pair [[hidden Markov model]] (Pair HMM) is used to determine these posterior probabilities of alignment for any two input sequences. The Pair HMM model uses two sets of Delete (D) and Insert (I) states to account for symbol deletion and insertions between two aligned sequences, but it can also have three states without a significant loss of accuracy. Since the number of pairwise comparisons needed to determine the posterior probability distributions of any two pairs of sequences is computationally expensive and quadratic in the amount of sequences that are being aligned, it is decreased by using a randomized approach inspired by the Erdos-Renyi theory of random graphs. This significantly reduces the runtimes of datasets and the computational cost of running the multiple alignments. === Merging Probabilities === The posterior probabilities for each column in the sequence pairs are sorted using a weighting function that uses a steepest-ascent algorithm. === Sequence Annealing === Most existing programs that run multiple sequence alignment algorithms are based on progressive alignment where the process starts with a “null alignment,” a state where none of the sequences have been aligned. The pool of sequences is then aligned either through pairwise comparisons or through an alignment of a pair of partial alignments of subsequences. This process can cause issues in alignment because the resulting multiple sequence alignment can and will heavily depend on the sequences that are aligned at the start. There is no realigning of previously aligned sequences that could correct the MSA. FSA uses the sequence annealing technique to overcome this issue. The sorted posterior probabilities are used with the sequence annealing technique to generate a multiple alignment. The technique finds the alignment between two sequences that minimizes the expected distance to the truth. In this case, the distance between two sequences is the number of columns in which the character from one sequence is not homologous to the character in the same column in the second sequence. The sequence annealing technique, by determining an alignment with the minimum expected distance to the truth, conversely finds the alignment with the maximum expected accuracy. The accuracy of an alignment depends on a “true” alignment as reference and indicates the fraction of columns where the sequences are homologous. This accuracy is then used as an objective function that starts with the unaligned sequences (null alignment) and aligns characters in different columns based on the increasing accuracy of an alignment. === Ordering of the alignment === FSA aligns multiple sequences based on homology within columns instead of strictly a consideration of indels and substitutions. As such, FSA considers alignments to be equivalent if for every position along the sequences in both alignments, the same statement about homology can be made. For example, when considering pairwise comparisons, if there is a gap at a specific position in two alignments, then it can be said that the two sequences being compared are not homologous at said position. This can result in alignments where gap-open events can differ and yet still be considered equivalent. As such, FSA chooses to output the alignment in which there is a minimum amount of “gap openings.” == Parallelization == To handle overly large datasets, FSA is able to divide the work of running all necessary pairwise comparisons and alignments to different processors. This is handled by using a “fixed-size chunking” strategy that distributes the pairwise comparisons to each available processor in chunks. Each processor is therefore able to run the posterior probability calculation on a chunk of pairwise comparisons before merging the collected data back to a single processor for sequence annealing. == Visualization == The results of the multiple sequence alignment under FSA can be displayed under the FSA's own GUI. The GUI is able to display and color label different measures of alignment quality on the columns of characters within the alignment itself. The five different measures that can be observed and are approximated under the FSA model include accuracy, sensitivity, certainty, specificity, and consistency. == Comparisons to other alignment programs == FSA has been benchmarked against multiple alignment databases for protein (SABmark 1.65 and BAliBASE 3), RNA (BRAliBase 2.1 and Consanmix80), and DNA sequences. These benchmarks were conducted alongside other popular alignment programs such as ClustalW, MAFFT, MUSCLE, T-Coffee, and so on. Overall, at the time that FSA's abstract and research paper was received for review, FSA outperformed most alignment programs in accuracy and positive predictive values with sensitivities being on-par with the better-performing programs such as MAFFT and ProbConsRNA. Runtime comparisons were also conducted by comparing the timings to align 16S ribosomal sequences. MAFFT performed the alignment faster than the other alignment programs while MUSCLE and FSA (using a 3-state HMM and with disabled iterative refinement) were the next fastest programs. == References == {{cite journal|vauthors=Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L|author8-link=Lior Pachter|date=2009|title=Fast Statistical Alignment|journal=PLOS Computational Biology |volume=5|issue=5|pages=e1000392|doi=10.1371/journal.pcbi.1000392|pmid=19478997|pmc=2684580|bibcode=2009PLSCB...5E0392B |doi-access=free }} Schwartz AS, Pachter L (2007) Multiple alignment by sequence annealing. Bioinformatics 23: e24-9. Eddy SR. Multiple alignment using hidden Markov models. Proc Int Conf Intell Syst Mol Biol. 1995;3:114-20. PMID 7584426. == External links == * [https://web.archive.org/web/20090430184649/http://orangutan.math.berkeley.edu/fsa/ FSA web server] * [http://fsa.sourceforge.net/ FSA source code] [[Category:Bioinformatics]]'
New page wikitext, after the edit (new_wikitext)
'{{No inline|date={{subst:CURRENTMONTHNAME}} {{subst:CURRENTYEAR}}}}{{Infobox Software|name=FSA|developer=Robert Bradley ([[UC Berkeley]]), Colin Dewey ([[UW Madison]]), [[Lior Pachter]] ([[UC Berkeley]])|latest_release_version=1.5.2|operating_system=[[UNIX]], [[Linux]], [[Apple Macintosh|Mac]]|genre=Bioinformatics tool|licence=Open source}} '''Fast statistical alignment''' or '''FSA''' is a [[multiple sequence alignment]] program for aligning many proteins, [[RNA]]<nowiki/>s, or long genomic [[Nucleic acid sequence|DNA sequences]]. Along with [[MUSCLE (alignment software)|MUSCLE]] and [[MAFFT]], FSA is one of the few sequence alignment programs which can align datasets of hundreds or thousands of sequences. FSA uses a different optimization criterion which allows it to more reliably identify non-homologous sequences than these other programs, although this increased accuracy comes at the cost of decreased speed. FSA is currently being used for multiple projects, including sequencing new worm genomes and analyzing ''in vivo'' transcription factor binding in flies. == Input/Output == This program accepts sequences in [[FASTA format]] and outputs alignments in [[FASTA format]] or [[Stockholm format]]. == Algorithm == The algorithm for the aligning of the input sequences has 4 core components. === Pair Hidden Markov Model for generating posterior probabilities=== The algorithm starts first by determining [[posterior probabilities]] of alignment <math>\mathbb{P}(A|X, Y)</math> between any two random sequences from the pool of sequences being aligned. The posterior probabilities for each column reinforce the prediction of alignment probability between a sequence pair and also filter out columns that can be unreliably aligned. These probabilities also allow for the prediction and estimate of homology between any sequence pair. A standard five-state pair [[hidden Markov model]] (Pair HMM) is used to determine these posterior probabilities of alignment for any two input sequences. The Pair HMM model uses two sets of Delete (D) and Insert (I) states to account for symbol deletion and insertions between two aligned sequences, but it can also have three states without a significant loss of accuracy. Since the number of pair-wise comparisons needed to determine the posterior probability distributions of any two pairs of sequences is computationally expensive and quadratic in the amount of sequences that are being aligned, it is decreased by using a randomized approach inspired by the Erdos-Renyi theory of random graphs. This significantly reduces the runtimes of datasets and the computational cost of running the multiple alignments. === Merging Probabilities === The posterior probabilities for each column in the sequence pairs are sorted using a weighting function that uses a steepest-ascent algorithm. === Sequence Annealing === Most existing programs that run multiple sequence alignment algorithms are based on progressive alignment where the process starts with a "null alignment," a state where none of the sequences have been aligned. The pool of sequences is then aligned either through pairwise comparisons or through an alignment of a pair of partial alignments of subsequences. This process can cause issues in alignment because the resulting multiple sequence alignment can and will heavily depend on the sequences that are aligned at the start. There is no realigning of previously aligned sequences that could correct the MSA. FSA uses the sequence annealing technique to overcome this issue. The sorted posterior probabilities are used with the sequence annealing technique to generate a multiple alignment. The technique finds the alignment between two sequences that minimizes the expected distance to the truth. In this case, the distance between two sequences is the number of columns in which the character from one sequence is not homologous to the character in the same column in the second sequence. The sequence annealing technique, by determining an alignment with the minimum expected distance to the truth, conversely finds the alignment with the maximum expected accuracy. The accuracy of an alignment depends on a "true" alignment as reference and indicates the fraction of columns where the sequences are homologous. This accuracy is then used as an objective function that starts with the unaligned sequences (null alignment) and aligns characters in different columns based on the increasing accuracy of an alignment. === Ordering of the alignment === FSA aligns multiple sequences based on homology within columns instead of strictly a consideration of indels and substitutions. As such, FSA considers alignments to be equivalent if for every position along the sequences in both alignments, the same statement about homology can be made. For example, when considering pairwise comparisons, if there is a gap at a specific position in two alignments, then it can be said that the two sequences being compared are not homologous at said position. This can result in alignments where gap-open events can differ and yet still be considered equivalent. As such, FSA chooses to output the alignment in which there is a minimum amount of "gap openings". == Parallelization == To handle overly large datasets, FSA is able to divide the work of running all necessary pairwise comparisons and alignments to different processors. This is handled by using a "fixed-size chunking" strategy that distributes the pairwise comparisons to each available processor in chunks. Each processor is therefore able to run the posterior probability calculation on a chunk of pairwise comparisons before merging the collected data back to a single processor for sequence annealing. == Visualization == The results of the multiple sequence alignment under FSA can be displayed under the FSA's own GUI. The GUI is able to display and color label different measures of alignment quality on the columns of characters within the alignment itself. The five different measures that can be observed and are approximated under the FSA model include accuracy, sensitivity, certainty, specificity, and consistency. == Comparisons to other programs == FSA has been benchmarked against multiple alignment databases for protein (SABmark 1.65 and BAliBASE 3), RNA (BRAliBase 2.1 and Consanmix80), and DNA sequences. These benchmarks were conducted alongside other popular alignment programs such as ClustalW, MAFFT, MUSCLE, T-Coffee, and so on. Overall, at the time that FSA's abstract and research paper was received for review, FSA outperformed most alignment programs in accuracy and positive predictive values with sensitivities being on-par with the better-performing programs such as MAFFT and ProbConsRNA. Runtime comparisons were also conducted by comparing the timings to align 16S ribosomal sequences. MAFFT performed the alignment faster than the other alignment programs while MUSCLE and FSA (using a 3-state HMM and with disabled iterative refinement) were the next fastest programs. == References == {{cite journal|vauthors=Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L|author8-link=Lior Pachter|date=2009|title=Fast Statistical Alignment|journal=PLOS Computational Biology |volume=5|issue=5|pages=e1000392|doi=10.1371/journal.pcbi.1000392|pmid=19478997|pmc=2684580|bibcode=2009PLSCB...5E0392B |doi-access=free }} Schwartz AS, Pachter L (2007) Multiple alignment by sequence annealing. Bioinformatics 23: e24-9. Eddy SR. Multiple alignment using hidden Markov models. Proc Int Conf Intell Syst Mol Biol. 1995;3:114-20. PMID 7584426. == External links == * [https://web.archive.org/web/20090430184649/http://orangutan.math.berkeley.edu/fsa/ FSA web server] * [http://fsa.sourceforge.net/ FSA source code] [[Category:Bioinformatics]]'
Unified diff of changes made by edit (edit_diff)
'@@ -1,7 +1,7 @@ -{{Infobox Software|name=FSA|developer=Robert Bradley ([[UC Berkeley]]), Colin Dewey ([[UW Madison]]), [[Lior Pachter]] ([[UC Berkeley]])|latest_release_version=1.5.2|operating_system=[[UNIX]], [[Linux]], [[Apple Macintosh|Mac]]|genre=Bioinformatics tool|licence=Open source}} +{{No inline|date={{subst:CURRENTMONTHNAME}} {{subst:CURRENTYEAR}}}}{{Infobox Software|name=FSA|developer=Robert Bradley ([[UC Berkeley]]), Colin Dewey ([[UW Madison]]), [[Lior Pachter]] ([[UC Berkeley]])|latest_release_version=1.5.2|operating_system=[[UNIX]], [[Linux]], [[Apple Macintosh|Mac]]|genre=Bioinformatics tool|licence=Open source}} -'''Fast statistical alignment''' or '''FSA''' is a [[multiple sequence alignment]] program for aligning many proteins or RNAs or long genomic DNA sequences. Along with [[MUSCLE (alignment software)|MUSCLE]] and [[MAFFT]], FSA is one of the few sequence alignment programs which can align datasets of hundreds or thousands of sequences. FSA uses a different optimization criterion which allows it to more reliably identify non-homologous sequences than these other programs, although this increased accuracy comes at the cost of decreased speed. +'''Fast statistical alignment''' or '''FSA''' is a [[multiple sequence alignment]] program for aligning many proteins, [[RNA]]<nowiki/>s, or long genomic [[Nucleic acid sequence|DNA sequences]]. Along with [[MUSCLE (alignment software)|MUSCLE]] and [[MAFFT]], FSA is one of the few sequence alignment programs which can align datasets of hundreds or thousands of sequences. FSA uses a different optimization criterion which allows it to more reliably identify non-homologous sequences than these other programs, although this increased accuracy comes at the cost of decreased speed. -FSA is currently being used for projects including sequencing new worm genomes and analyzing ''in vivo'' transcription factor binding in flies. +FSA is currently being used for multiple projects, including sequencing new worm genomes and analyzing ''in vivo'' transcription factor binding in flies. == Input/Output == @@ -12,7 +12,7 @@ === Pair Hidden Markov Model for generating posterior probabilities=== -The algorithm starts first by determining [[posterior probabilities]] of alignment <math>\mathbb{P}(A|X, Y)</math> between any two random sequences from the pool of sequences being aligned. The posterior probabilities for each column reinforce the prediction of alignment probability between a sequence pair and also filter out columns that can be unreliably aligned. These probabilities also allow for the prediction and estimate of homology between any sequence pair. A standard five-state pair [[hidden Markov model]] (Pair HMM) is used to determine these posterior probabilities of alignment for any two input sequences. The Pair HMM model uses two sets of Delete (D) and Insert (I) states to account for symbol deletion and insertions between two aligned sequences, but it can also have three states without a significant loss of accuracy. +The algorithm starts first by determining [[posterior probabilities]] of alignment <math>\mathbb{P}(A|X, Y)</math> between any two random sequences from the pool of sequences being aligned. The posterior probabilities for each column reinforce the prediction of alignment probability between a sequence pair and also filter out columns that can be unreliably aligned. These probabilities also allow for the prediction and estimate of homology between any sequence pair. A standard five-state pair [[hidden Markov model]] (Pair HMM) is used to determine these posterior probabilities of alignment for any two input sequences. The Pair HMM model uses two sets of Delete (D) and Insert (I) states to account for symbol deletion and insertions between two aligned sequences, but it can also have three states without a significant loss of accuracy. -Since the number of pairwise comparisons needed to determine the posterior probability distributions of any two pairs of sequences is computationally expensive and quadratic in the amount of sequences that are being aligned, it is decreased by using a randomized approach inspired by the Erdos-Renyi theory of random graphs. This significantly reduces the runtimes of datasets and the computational cost of running the multiple alignments. +Since the number of pair-wise comparisons needed to determine the posterior probability distributions of any two pairs of sequences is computationally expensive and quadratic in the amount of sequences that are being aligned, it is decreased by using a randomized approach inspired by the Erdos-Renyi theory of random graphs. This significantly reduces the runtimes of datasets and the computational cost of running the multiple alignments. === Merging Probabilities === @@ -20,20 +20,20 @@ === Sequence Annealing === -Most existing programs that run multiple sequence alignment algorithms are based on progressive alignment where the process starts with a “null alignment,” a state where none of the sequences have been aligned. The pool of sequences is then aligned either through pairwise comparisons or through an alignment of a pair of partial alignments of subsequences. This process can cause issues in alignment because the resulting multiple sequence alignment can and will heavily depend on the sequences that are aligned at the start. There is no realigning of previously aligned sequences that could correct the MSA. +Most existing programs that run multiple sequence alignment algorithms are based on progressive alignment where the process starts with a "null alignment," a state where none of the sequences have been aligned. The pool of sequences is then aligned either through pairwise comparisons or through an alignment of a pair of partial alignments of subsequences. This process can cause issues in alignment because the resulting multiple sequence alignment can and will heavily depend on the sequences that are aligned at the start. There is no realigning of previously aligned sequences that could correct the MSA. -FSA uses the sequence annealing technique to overcome this issue. The sorted posterior probabilities are used with the sequence annealing technique to generate a multiple alignment. The technique finds the alignment between two sequences that minimizes the expected distance to the truth. In this case, the distance between two sequences is the number of columns in which the character from one sequence is not homologous to the character in the same column in the second sequence. +FSA uses the sequence annealing technique to overcome this issue. The sorted posterior probabilities are used with the sequence annealing technique to generate a multiple alignment. The technique finds the alignment between two sequences that minimizes the expected distance to the truth. In this case, the distance between two sequences is the number of columns in which the character from one sequence is not homologous to the character in the same column in the second sequence. -The sequence annealing technique, by determining an alignment with the minimum expected distance to the truth, conversely finds the alignment with the maximum expected accuracy. The accuracy of an alignment depends on a “true” alignment as reference and indicates the fraction of columns where the sequences are homologous. This accuracy is then used as an objective function that starts with the unaligned sequences (null alignment) and aligns characters in different columns based on the increasing accuracy of an alignment. +The sequence annealing technique, by determining an alignment with the minimum expected distance to the truth, conversely finds the alignment with the maximum expected accuracy. The accuracy of an alignment depends on a "true" alignment as reference and indicates the fraction of columns where the sequences are homologous. This accuracy is then used as an objective function that starts with the unaligned sequences (null alignment) and aligns characters in different columns based on the increasing accuracy of an alignment. === Ordering of the alignment === -FSA aligns multiple sequences based on homology within columns instead of strictly a consideration of indels and substitutions. As such, FSA considers alignments to be equivalent if for every position along the sequences in both alignments, the same statement about homology can be made. For example, when considering pairwise comparisons, if there is a gap at a specific position in two alignments, then it can be said that the two sequences being compared are not homologous at said position. This can result in alignments where gap-open events can differ and yet still be considered equivalent. As such, FSA chooses to output the alignment in which there is a minimum amount of “gap openings.” +FSA aligns multiple sequences based on homology within columns instead of strictly a consideration of indels and substitutions. As such, FSA considers alignments to be equivalent if for every position along the sequences in both alignments, the same statement about homology can be made. For example, when considering pairwise comparisons, if there is a gap at a specific position in two alignments, then it can be said that the two sequences being compared are not homologous at said position. This can result in alignments where gap-open events can differ and yet still be considered equivalent. As such, FSA chooses to output the alignment in which there is a minimum amount of "gap openings". == Parallelization == -To handle overly large datasets, FSA is able to divide the work of running all necessary pairwise comparisons and alignments to different processors. This is handled by using a “fixed-size chunking” strategy that distributes the pairwise comparisons to each available processor in chunks. Each processor is therefore able to run the posterior probability calculation on a chunk of pairwise comparisons before merging the collected data back to a single processor for sequence annealing. +To handle overly large datasets, FSA is able to divide the work of running all necessary pairwise comparisons and alignments to different processors. This is handled by using a "fixed-size chunking" strategy that distributes the pairwise comparisons to each available processor in chunks. Each processor is therefore able to run the posterior probability calculation on a chunk of pairwise comparisons before merging the collected data back to a single processor for sequence annealing. == Visualization == The results of the multiple sequence alignment under FSA can be displayed under the FSA's own GUI. The GUI is able to display and color label different measures of alignment quality on the columns of characters within the alignment itself. The five different measures that can be observed and are approximated under the FSA model include accuracy, sensitivity, certainty, specificity, and consistency. -== Comparisons to other alignment programs == +== Comparisons to other programs == FSA has been benchmarked against multiple alignment databases for protein (SABmark 1.65 and BAliBASE 3), RNA (BRAliBase 2.1 and Consanmix80), and DNA sequences. These benchmarks were conducted alongside other popular alignment programs such as ClustalW, MAFFT, MUSCLE, T-Coffee, and so on. Overall, at the time that FSA's abstract and research paper was received for review, FSA outperformed most alignment programs in accuracy and positive predictive values with sensitivities being on-par with the better-performing programs such as MAFFT and ProbConsRNA. Runtime comparisons were also conducted by comparing the timings to align 16S ribosomal sequences. MAFFT performed the alignment faster than the other alignment programs while MUSCLE and FSA (using a 3-state HMM and with disabled iterative refinement) were the next fastest programs. '
New page size (new_size)
7830
Old page size (old_size)
7744
Size change in edit (edit_delta)
86
Lines added in edit (added_lines)
[ 0 => '{{No inline|date={{subst:CURRENTMONTHNAME}} {{subst:CURRENTYEAR}}}}{{Infobox Software|name=FSA|developer=Robert Bradley ([[UC Berkeley]]), Colin Dewey ([[UW Madison]]), [[Lior Pachter]] ([[UC Berkeley]])|latest_release_version=1.5.2|operating_system=[[UNIX]], [[Linux]], [[Apple Macintosh|Mac]]|genre=Bioinformatics tool|licence=Open source}}', 1 => ''''Fast statistical alignment''' or '''FSA''' is a [[multiple sequence alignment]] program for aligning many proteins, [[RNA]]<nowiki/>s, or long genomic [[Nucleic acid sequence|DNA sequences]]. Along with [[MUSCLE (alignment software)|MUSCLE]] and [[MAFFT]], FSA is one of the few sequence alignment programs which can align datasets of hundreds or thousands of sequences. FSA uses a different optimization criterion which allows it to more reliably identify non-homologous sequences than these other programs, although this increased accuracy comes at the cost of decreased speed.', 2 => 'FSA is currently being used for multiple projects, including sequencing new worm genomes and analyzing ''in vivo'' transcription factor binding in flies.', 3 => 'The algorithm starts first by determining [[posterior probabilities]] of alignment <math>\mathbb{P}(A|X, Y)</math> between any two random sequences from the pool of sequences being aligned. The posterior probabilities for each column reinforce the prediction of alignment probability between a sequence pair and also filter out columns that can be unreliably aligned. These probabilities also allow for the prediction and estimate of homology between any sequence pair. A standard five-state pair [[hidden Markov model]] (Pair HMM) is used to determine these posterior probabilities of alignment for any two input sequences. The Pair HMM model uses two sets of Delete (D) and Insert (I) states to account for symbol deletion and insertions between two aligned sequences, but it can also have three states without a significant loss of accuracy.', 4 => 'Since the number of pair-wise comparisons needed to determine the posterior probability distributions of any two pairs of sequences is computationally expensive and quadratic in the amount of sequences that are being aligned, it is decreased by using a randomized approach inspired by the Erdos-Renyi theory of random graphs. This significantly reduces the runtimes of datasets and the computational cost of running the multiple alignments.', 5 => 'Most existing programs that run multiple sequence alignment algorithms are based on progressive alignment where the process starts with a "null alignment," a state where none of the sequences have been aligned. The pool of sequences is then aligned either through pairwise comparisons or through an alignment of a pair of partial alignments of subsequences. This process can cause issues in alignment because the resulting multiple sequence alignment can and will heavily depend on the sequences that are aligned at the start. There is no realigning of previously aligned sequences that could correct the MSA. ', 6 => 'FSA uses the sequence annealing technique to overcome this issue. The sorted posterior probabilities are used with the sequence annealing technique to generate a multiple alignment. The technique finds the alignment between two sequences that minimizes the expected distance to the truth. In this case, the distance between two sequences is the number of columns in which the character from one sequence is not homologous to the character in the same column in the second sequence. ', 7 => 'The sequence annealing technique, by determining an alignment with the minimum expected distance to the truth, conversely finds the alignment with the maximum expected accuracy. The accuracy of an alignment depends on a "true" alignment as reference and indicates the fraction of columns where the sequences are homologous. This accuracy is then used as an objective function that starts with the unaligned sequences (null alignment) and aligns characters in different columns based on the increasing accuracy of an alignment. ', 8 => 'FSA aligns multiple sequences based on homology within columns instead of strictly a consideration of indels and substitutions. As such, FSA considers alignments to be equivalent if for every position along the sequences in both alignments, the same statement about homology can be made. For example, when considering pairwise comparisons, if there is a gap at a specific position in two alignments, then it can be said that the two sequences being compared are not homologous at said position. This can result in alignments where gap-open events can differ and yet still be considered equivalent. As such, FSA chooses to output the alignment in which there is a minimum amount of "gap openings".', 9 => 'To handle overly large datasets, FSA is able to divide the work of running all necessary pairwise comparisons and alignments to different processors. This is handled by using a "fixed-size chunking" strategy that distributes the pairwise comparisons to each available processor in chunks. Each processor is therefore able to run the posterior probability calculation on a chunk of pairwise comparisons before merging the collected data back to a single processor for sequence annealing.', 10 => '== Comparisons to other programs ==' ]
Lines removed in edit (removed_lines)
[ 0 => '{{Infobox Software|name=FSA|developer=Robert Bradley ([[UC Berkeley]]), Colin Dewey ([[UW Madison]]), [[Lior Pachter]] ([[UC Berkeley]])|latest_release_version=1.5.2|operating_system=[[UNIX]], [[Linux]], [[Apple Macintosh|Mac]]|genre=Bioinformatics tool|licence=Open source}}', 1 => ''''Fast statistical alignment''' or '''FSA''' is a [[multiple sequence alignment]] program for aligning many proteins or RNAs or long genomic DNA sequences. Along with [[MUSCLE (alignment software)|MUSCLE]] and [[MAFFT]], FSA is one of the few sequence alignment programs which can align datasets of hundreds or thousands of sequences. FSA uses a different optimization criterion which allows it to more reliably identify non-homologous sequences than these other programs, although this increased accuracy comes at the cost of decreased speed.', 2 => 'FSA is currently being used for projects including sequencing new worm genomes and analyzing ''in vivo'' transcription factor binding in flies.', 3 => 'The algorithm starts first by determining [[posterior probabilities]] of alignment <math>\mathbb{P}(A|X, Y)</math> between any two random sequences from the pool of sequences being aligned. The posterior probabilities for each column reinforce the prediction of alignment probability between a sequence pair and also filter out columns that can be unreliably aligned. These probabilities also allow for the prediction and estimate of homology between any sequence pair. A standard five-state pair [[hidden Markov model]] (Pair HMM) is used to determine these posterior probabilities of alignment for any two input sequences. The Pair HMM model uses two sets of Delete (D) and Insert (I) states to account for symbol deletion and insertions between two aligned sequences, but it can also have three states without a significant loss of accuracy.', 4 => 'Since the number of pairwise comparisons needed to determine the posterior probability distributions of any two pairs of sequences is computationally expensive and quadratic in the amount of sequences that are being aligned, it is decreased by using a randomized approach inspired by the Erdos-Renyi theory of random graphs. This significantly reduces the runtimes of datasets and the computational cost of running the multiple alignments.', 5 => 'Most existing programs that run multiple sequence alignment algorithms are based on progressive alignment where the process starts with a “null alignment,” a state where none of the sequences have been aligned. The pool of sequences is then aligned either through pairwise comparisons or through an alignment of a pair of partial alignments of subsequences. This process can cause issues in alignment because the resulting multiple sequence alignment can and will heavily depend on the sequences that are aligned at the start. There is no realigning of previously aligned sequences that could correct the MSA. ', 6 => 'FSA uses the sequence annealing technique to overcome this issue. The sorted posterior probabilities are used with the sequence annealing technique to generate a multiple alignment. The technique finds the alignment between two sequences that minimizes the expected distance to the truth. In this case, the distance between two sequences is the number of columns in which the character from one sequence is not homologous to the character in the same column in the second sequence. ', 7 => 'The sequence annealing technique, by determining an alignment with the minimum expected distance to the truth, conversely finds the alignment with the maximum expected accuracy. The accuracy of an alignment depends on a “true” alignment as reference and indicates the fraction of columns where the sequences are homologous. This accuracy is then used as an objective function that starts with the unaligned sequences (null alignment) and aligns characters in different columns based on the increasing accuracy of an alignment. ', 8 => 'FSA aligns multiple sequences based on homology within columns instead of strictly a consideration of indels and substitutions. As such, FSA considers alignments to be equivalent if for every position along the sequences in both alignments, the same statement about homology can be made. For example, when considering pairwise comparisons, if there is a gap at a specific position in two alignments, then it can be said that the two sequences being compared are not homologous at said position. This can result in alignments where gap-open events can differ and yet still be considered equivalent. As such, FSA chooses to output the alignment in which there is a minimum amount of “gap openings.”', 9 => 'To handle overly large datasets, FSA is able to divide the work of running all necessary pairwise comparisons and alignments to different processors. This is handled by using a “fixed-size chunking” strategy that distributes the pairwise comparisons to each available processor in chunks. Each processor is therefore able to run the posterior probability calculation on a chunk of pairwise comparisons before merging the collected data back to a single processor for sequence annealing.', 10 => '== Comparisons to other alignment programs ==' ]
Whether or not the change was made through a Tor exit node (tor_exit_node)
false
Unix timestamp of change (timestamp)
'1718666906'