Sequence Identification

Genome-wide identification of putative lectin genes in plants using BLASTp, HMMER, InterProScan, and Phytozome across 12 lectin families.
Author

Beaven Manjengwa

Published

May 1, 2026

Keywords

plant lectins, genome-wide identification, BLASTp, HMMER, InterProScan, Phytozome, lectin families, pfam domains

Sequence Identification

Objective of this section

Identify all putative lectin-encoding genes in a plant genome by searching for sequence similarity to reference model sequences and annotated Pfam lectin domains. The accuracy of this step is critical because all subsequent analyses builds on it.

Reference model sequences and Pfam domains

The 12 plant lectin families recognized in the literature1, their reference model sequences, and the Pfam domains used for identification are listed below.

Lectin Family Model Organism Lectin Domain Pfam ID Accession
ABA Agaricus bisporus FB_lectin PF07367 Q00022
Amaranthin Amaranthus caudatus Agglutinin PF07468 AAL05954.1
CRA Robinia pseudoacacia ABL98074.1
Cyanovirin Nostoc ellipsosporum CVNH PF08881 P81180
EUL Euonymus europaeus ABW73993.1
GNA Galanthus nivalis B_lectin PF01453 P30617
Hevein Hevea brasiliensis Chitin_bind_1 PF00187 P02877
JRL Artocarpus integer Jacalin PF01419 AAA32680.1
Legume lectin Glycine max Lectin_legB PF00139 P05046
LysM Brassica juncea LysM PF01476 BAN83772.1
Nictaba Nicotiana tabacum PP2 PF14299 AAK84134.1
Ricin B Ricinus communis Ricin_B_lectin PF00652 2AAI_B
Table 1: Reference model sequences and Pfam domains for the 12 plant lectin families. ABA: Agaricus bisporus agglutinin; CRA: Cratylia lectin; EUL: Euonymus lectin; GNA: Galanthus nivalis agglutinin; JRL: Jacalin-related lectin; LysM: Lysin motif.

Tools and databases

Tool / Database Tested Version Platform Purpose
BLAST+ (blastp) 2.15.0 Linux/Windows/macOS Homology-based lectin identification
HMMER (hmmsearch) 3.4 Linux Profile-based Pfam domain searching
InterProScan 5.77-108.0 Linux Predicting domains and important sites
NCBI Web Genome assembly and annotation download
Ensembl Plants Web Genome assembly and annotation download
Phytozome 14 Web Public repository of plant genomic resources
InterPro and/or Pfam2 Web Lectin domain HMM profiles
MapChart 2.32 Windows Graphical presentation of linkage maps and QTLs
MG2C3 2.1 Web Online chromosomal map construction and visualization
Table 2: Tools and databases which can be used in this analysis.

Phytozome-based Identification Workflow

Genome assemblies and predicted proteomes of available plant species are accessed through Phytozome4, a public repository of plant genomic resources.

Note

Some plant species have standalone databases for genome assemblies and annotation, most of which also provide BLASTp search capabilities directly just like Phytozome-BLASTp.

Step 1. Each of the 12 model sequences from Table 1 is submitted as a separate BLASTp query against the target species’ predicted proteome using Phytozome-BLASTp.

Tip

A permissive E-value threshold is recommended because model sequences are from distantly related organisms and divergent family members in the target genome.

BLOSUM62 is the preferred comparison matrix for this search. Default word length is appropriate and does not need adjustment.

Step 2. Sequences with the highest identity percentage from Step 1 are selected and used as queries in a second BLASTp search against the same assembly. This improves detection of divergent family members that the cross-species model sequences may miss.

Step 3. All candidate sequences identified across both rounds are retrieved using the Phytozome-BioMart tool. Structural annotation data available for download include gene name, transcript name, chromosome and transcript coordinates, and gene description.

Step 4. Retrieved sequences are submitted to the InterPro web service for scanning against the InterPro protein signature databases, including Pfam, to predict domains and important sites.

Note

Proteins with at least one lectin domain can be considered as putative lectins for downstream analysis. Many identified lectin genes encode chimeric proteins fused with additional functional domains such as protein kinases, and F-box domains.

Local InterProScan Workflow

When InterProScan is installed on a local server, the complete predicted proteome can be scanned in a single run and results filtered against the Pfam IDs in Table 1.

Step 1. Run InterProScan against the complete predicted proteome of the target species.

Note

InterProScan applies gathering thresholds by default, ensuring only statistically significant domain matches are reported.

Step 2. Filter the output by the Pfam IDs listed in Table 1 to extract all sequences with a lectin domain.

Step 3. For CRA and EUL, which have no Pfam domain, a separate BLASTp search using their model sequences (Table 1) is still required.

Key Insights

  • Not all 12 lectin families are detected in every plant genome. ABA, Amaranthin, and Cyanovirin families were not detected across multiple species including arabidopsis, phaseolus species, and cucumber.
  • Legume and GNA families are usually the most abundant in plant genomes, though their relative numbers varies by species
  • Most identified lectin genes encode chimeric proteins fused with additional functional domains.

Key Limitations

  • Genome assembly and annotation quality directly affects the number of sequences retrieved. Fragmented assemblies and incomplete annotations can lead to underestimation of lectin gene numbers.
  • No single database covers all 12 families equally; Pfam annotation is incomplete for several non-model lectin families.

Published Studies

Phaseolus Species5, Arabidopsis thaliana6, Cucumber (Cucumis sativus)7, Rice (Oryza sativa)8, soybean (Glycine max)9, and Sorghum (Sorghum bicolor)10

References

1.
Van Damme, E. J. M., Lannoo, N. & Peumans, W. J. Plant lectins. in Advances in botanical research vol. 48 107–209 (Academic Press, 2008).
2.
Blum, M. et al. InterPro: The protein sequence classification resource in 2025. Nucleic Acids Research 53, D444–D456 (2024).
3.
Chao, J. et al. MG2C: A user-friendly online tool for drawing genetic maps. Molecular Horticulture 1, (2021).
4.
Goodstein, D. M. et al. Phytozome: A comparative platform for green plant genomics. Nucleic Acids Research 40, D1178–D1186 (2012).
5.
6.
Eggermont, L., Verstraeten, B. & Van Damme, E. J. M. Genome-wide screening for lectin motifs in arabidopsis thaliana. The Plant Genome 10, plantgenome2017.02.0010 (2017).
7.
Dang, L. & Van Damme, E. J. M. Genome-wide identification and domain organization of lectin domains in cucumber. Plant Physiology and Biochemistry 108, 165–176 (2016).
8.
Tsaneva, M., De Schutter, K., Verstraeten, B. & Van Damme, E. J. M. Lectin sequence distribution in QTLs from rice (oryza sativa) suggest a role in morphological traits and stress responses. International Journal of Molecular Sciences 20, 437 (2019).
9.
Van Holle, S. & Van Damme, E. Distribution and evolution of the lectin family in soybean (glycine max). Molecules 20, 2868–2891 (2015).
10.