Workflow para Alinhamento Exato de Sequências em Sistemas de Processamento de Alto Desempenho

Rafael Terra; Kelen Souza; Hiago Rocha; Carla Osthoff; Diego Carvalho; Kary Ocaña

doi:10.5753/sscad.2025.16727

Rafael Terra LNCC
Kelen Souza LNCC / FAETERJ
Hiago Rocha LNCC
Carla Osthoff LNCC
Diego Carvalho CEFET-RJ
Kary Ocaña LNCC

DOI: https://doi.org/10.5753/sscad.2025.16727

Resumo

O Alinhamento Múltiplo de Sequências (AMS) é uma etapa fundamental na biologia evolutiva molecular, com impacto direto na identificação de marcadores associados a doenças genéticas e infecciosas. A qualidade dos alinhamentos é determinante para a confiabilidade das interpretações biológicas. No entanto, algoritmos exatos para AMS enfrentam limitações quanto ao número de sequências que podem ser processadas, mesmo em ambientes de Processamento de Alto Desempenho (PAD), devido à natureza NP-difícil do problema. Assim, a prática usual envolve a seleção manual de subconjuntos reduzidos de sequências. Neste trabalho, propomos e avaliamos um workflow científico em PAD para AMS com algoritmos exatos, incorporando a seleção automática do subconjunto representativo de sequências, de modo a contornar as restrições das ferramentas disponíveis. Os experimentos, considerando implementações do workflow com PyCOMPSs e com scripts Shell, apresentaram ganhos de 2, 08× e 3, 95×, respectivamente, em relação à execução sequencial das tarefas. Em particular, a implementação com PyCOMPSs mostrou melhor escalabilidade, alcançando 80,8% de ganho no alinhamento de 38 sequências.

Referências

Cohen-Boulakia, S., Belhajjame, K., Collin, O., Chopard, J., Froidevaux, C., Gaignard, A., Hinsen, K., Larmande, P., Bras, Y. L., Lemoine, F., Mareuil, F., Ménager, H., Pradal, C., and Blanchet, C. (2017). Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities. Future Generation Computer Systems, 75:284–298.

De O. Sandes, E. F., Miranda, G., Martorell, X., Ayguade, E., Teodoro, G., and De Melo, A. C. (2016). Masa: A multiplatform architecture for sequence aligners with block pruning. ACM Transactions on Parallel Computing (TOPC), 2(4):1–31.

Gao, F., Chen, C., Arab, D. A., Du, Z., He, Y., and Ho, S. Y. W. (2019). EasyCodeML: A visual tool for analysis of selection using CodeML. Ecology and evolution, 9(7):3891–3898.

Katoh, K. and Standley, D. M. (2013). Mafft multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution, 30(4):772–780.

Lassmann, T. (2020). Kalign 3: multiple sequence alignment of large datasets. Bioinformatics (Oxford, England), 36.

Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453.

Papadopoulos, J. S. and Agarwala, R. (2007). Cobalt: constraint-based alignment tool for multiple protein sequences. Bioinformatics, 23(9):1073–1079.

Prestwich, S., Higgins, D., and O’Sullivan, O. (2003). A sat-based approach to multiple sequence alignment. In International Conference on Principles and Practice of Constraint Programming, pages 940–944. Springer.

Schabauer, H., Valle, M., Pacher, C., Stockinger, H., Stamatakis, A., Robinson-Rechavi, M., Yang, Z., and Salamin, N. (2012). SlimCodeML: An Optimized Version of CodeML for the Branch-Site Model. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pages 706–714. IEEE.

Sievers, F. and Higgins, D. G. (2018). Clustal omega for making accurate alignments of many protein sequences. Protein Science, 27(1):135–145.

Slowinski, J. B. (1998). The number of multiple alignments. Molecular Phylogenetics and Evolution, 10(2):264–266.

Smith, T. F., Waterman, M. S., et al. (1981). Identification of common molecular subsequences. Journal of molecular biology, 147(1):195–197.

Sundfeld, D., Razzolini, C., Teodoro, G., Boukerche, A., and de Melo, A. C. M. A. (2018). Pa-star: A disk-assisted parallel a-star strategy with locality-sensitive hash for multiple sequence alignment. Journal of Parallel and Distributed Computing, 112:154–165. Parallel Optimization using/for Multi and Many-core High Performance Computing.

Suter, F., Coleman, T., Altintaş, İ., Badia, R. M., Balis, B., Chard, K., Colonnelli, I., Deelman, E., Di Tommaso, P., Fahringer, T., et al. (2025). A terminology for scientific workflow systems. Future Generation Computer Systems, page 107974.

Tejedor, E., Becerra, Y., Alomar, G., Queralt, A., Badia, R. M., Torres, J., Cortes, T., and Labarta, J. (2017). PyCOMPSs: Parallel computational workflows in Python. The International Journal of High Performance Computing Applications, 31(1):66–82.

Tripathi, R., Sharma, P., Chakraborty, P., and Varadwaj, P. K. (2016). Next-generation sequencing revolution through big data analytics. Frontiers in Life Science, 9(2):119–149.

Wang, L. and Jiang, T. (1994). On the complexity of multiple sequence alignment. Journal of computational biology, 1(4):337–348.

Yang, Z. (2007). PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology and Evolution, 24(8):1586–1591.

Yoo, A. B., Jette, M. A., and Grondona, M. (2003). SLURM: Simple Linux Utility for Resource Management. In Goos, G., Hartmanis, J., van Leeuwen, J., Feitelson, D., Rudolph, L., and Schwiegelshohn, U., editors, Job Scheduling Strategies for Parallel Processing, volume 2862, pages 44–60. Springer Berlin Heidelberg, Berlin, Heidelberg.