What are the aspects of data mining for Biological Data Analysis?

There are the following aspects of data mining for biological data analysis which are as follows −

Semantic integration of heterogeneous, distributed genomic and proteomic databases − Genomic and proteomic data sets are generated at multiple labs and by various methods. They are distributed, heterogeneous, and of a wide variety. The semantic integration of such data is important to the cross-site analysis of biological records.

Furthermore, it is essential to find correct linkages among research literature and their related biological entities. Such integration and linkage analysis can support the systematic and coordinated analysis of genome and biological records. This has promoted the development of integrated data warehouses and distributed federated databases to save and handle the basic and changed biological data.

Data cleaning, data integration, reference reconciliation, classification, and clustering methods will support the integration of biological records and the development of data warehouses for biological data analysis.

Alignment, indexing, similarity search, and comparative analysis of multiple nucleotide/protein sequences − There are various biological sequence alignment methods that have been developed in the past two decades. BLAST and FASTA, in particular, are tools for the systematic analysis of genomic and proteomic data. Biological sequence analysis methods differ from many sequential pattern analysis algorithms proposed in data mining research.

They should allow for gaps and mismatches between a query sequence and the sequence data to be searched in order to deal with insertions, deletions, and mutations. Furthermore, for protein sequences, two amino acids must also be treated as “match” if one can be changed from the other by substitutions that are likely to appear in nature.

Discovery of structural patterns and analysis of genetic networks and protein pathways − In biology, protein sequences are folded into three-dimensional structures, and such structures interact with each other based on their relative positions and the distances between them. Such complex interactions form the basis of sophisticated genetic networks and protein pathways.

It is crucial to discover structural patterns and regularities among such huge but complex biological networks. It is important to develop powerful and scalable data mining methods to discover approximate and frequent structural patterns and to study the regularities and irregularities among such interconnected biological networks.

Association and path analysis − It can be identifying co-occurring gene sequences and link genes to different stages of disease development. Association analysis methods can be used to regulate the type of genes that are possible to follow in target samples. Such analysis would support the discovery of teams of genes and the study of interactions and relationships among them.