Sequence Identification

Genome-wide identification of putative lectin genes in plants using BLASTp, HMMER, InterProScan, and Phytozome across 12 lectin families.

Author

Beaven Manjengwa

Published

May 1, 2026

Keywords

plant lectins, genome-wide identification, BLASTp, HMMER, InterProScan, Phytozome, lectin families, pfam domains

Sequence Identification

Objective of this section

Identify all putative lectin-encoding genes in a plant genome by searching for sequence similarity to reference model sequences and annotated Pfam lectin domains. The accuracy of this step is critical because all subsequent analyses builds on it.

Reference model sequences and Pfam domains

The 12 plant lectin families recognized in the literature¹, their reference model sequences, and the Pfam domains used for identification are listed below.

Lectin Family	Model Organism	Lectin Domain	Pfam ID	Accession
ABA	Agaricus bisporus	FB_lectin	PF07367	Q00022
Amaranthin	Amaranthus caudatus	Agglutinin	PF07468	AAL05954.1
CRA	Robinia pseudoacacia	—	—	ABL98074.1
Cyanovirin	Nostoc ellipsosporum	CVNH	PF08881	P81180
EUL	Euonymus europaeus	—	—	ABW73993.1
GNA	Galanthus nivalis	B_lectin	PF01453	P30617
Hevein	Hevea brasiliensis	Chitin_bind_1	PF00187	P02877
JRL	Artocarpus integer	Jacalin	PF01419	AAA32680.1
Legume lectin	Glycine max	Lectin_legB	PF00139	P05046
LysM	Brassica juncea	LysM	PF01476	BAN83772.1
Nictaba	Nicotiana tabacum	PP2	PF14299	AAK84134.1
Ricin B	Ricinus communis	Ricin_B_lectin	PF00652	2AAI_B

Table 1: Reference model sequences and Pfam domains for the 12 plant lectin families. ABA: Agaricus bisporus agglutinin; CRA: Cratylia lectin; EUL: Euonymus lectin; GNA: Galanthus nivalis agglutinin; JRL: Jacalin-related lectin; LysM: Lysin motif.

Tools and databases

Tool / Database	Tested Version	Platform	Purpose
BLAST+ (blastp)	2.15.0	Linux/Windows/macOS	Homology-based lectin identification
HMMER (hmmsearch)	3.4	Linux	Profile-based Pfam domain searching
InterProScan	5.77-108.0	Linux	Predicting domains and important sites
NCBI	—	Web	Genome assembly and annotation download
Ensembl Plants	—	Web	Genome assembly and annotation download
Phytozome	14	Web	Public repository of plant genomic resources
InterPro and/or Pfam²	—	Web	Lectin domain HMM profiles
MapChart	2.32	Windows	Graphical presentation of linkage maps and QTLs
MG2C³	2.1	Web	Online chromosomal map construction and visualization

Table 2: Tools and databases which can be used in this analysis.

Phytozome-based Identification Workflow

Genome assemblies and predicted proteomes of available plant species are accessed through Phytozome⁴, a public repository of plant genomic resources.

Note

Some plant species have standalone databases for genome assemblies and annotation, most of which also provide BLASTp search capabilities directly just like Phytozome-BLASTp.

Step 1. Each of the 12 model sequences from Table 1 is submitted as a separate BLASTp query against the target species’ predicted proteome using Phytozome-BLASTp.

Tip

A permissive E-value threshold is recommended because model sequences are from distantly related organisms and divergent family members in the target genome.

BLOSUM62 is the preferred comparison matrix for this search. Default word length is appropriate and does not need adjustment.

Step 2. Sequences with the highest identity percentage from Step 1 are selected and used as queries in a second BLASTp search against the same assembly. This improves detection of divergent family members that the cross-species model sequences may miss.

Step 3. All candidate sequences identified across both rounds are retrieved using the Phytozome-BioMart tool. Structural annotation data available for download include gene name, transcript name, chromosome and transcript coordinates, and gene description.

Step 4. Retrieved sequences are submitted to the InterPro web service for scanning against the InterPro protein signature databases, including Pfam, to predict domains and important sites.

Note

Proteins with at least one lectin domain can be considered as putative lectins for downstream analysis. Many identified lectin genes encode chimeric proteins fused with additional functional domains such as protein kinases, and F-box domains.

Local InterProScan Workflow

When InterProScan is installed on a local server, the complete predicted proteome can be scanned in a single run and results filtered against the Pfam IDs in Table 1.

Step 1. Run InterProScan against the complete predicted proteome of the target species.

Note

InterProScan applies gathering thresholds by default, ensuring only statistically significant domain matches are reported.

Step 2. Filter the output by the Pfam IDs listed in Table 1 to extract all sequences with a lectin domain.

Step 3. For CRA and EUL, which have no Pfam domain, a separate BLASTp search using their model sequences (Table 1) is still required.

Key Insights

Not all 12 lectin families are detected in every plant genome. ABA, Amaranthin, and Cyanovirin families were not detected across multiple species including arabidopsis, phaseolus species, and cucumber.
Legume and GNA families are usually the most abundant in plant genomes, though their relative numbers varies by species
Most identified lectin genes encode chimeric proteins fused with additional functional domains.

Key Limitations

Genome assembly and annotation quality directly affects the number of sequences retrieved. Fragmented assemblies and incomplete annotations can lead to underestimation of lectin gene numbers.
No single database covers all 12 families equally; Pfam annotation is incomplete for several non-model lectin families.

Published Studies

Phaseolus Species⁵, Arabidopsis thaliana⁶, Cucumber (Cucumis sativus)⁷, Rice (Oryza sativa)⁸, soybean (Glycine max)⁹, and Sorghum (Sorghum bicolor)¹⁰

References

Van Damme, E. J. M., Lannoo, N. & Peumans, W. J. Plant lectins. in Advances in botanical research vol. 48 107–209 (Academic Press, 2008).

Blum, M. et al. InterPro: The protein sequence classification resource in 2025. Nucleic Acids Research 53, D444–D456 (2024).

Chao, J. et al. MG2C: A user-friendly online tool for drawing genetic maps. Molecular Horticulture 1, (2021).

Goodstein, D. M. et al. Phytozome: A comparative platform for green plant genomics. Nucleic Acids Research 40, D1178–D1186 (2012).

Osman, M. E. M. et al. Lectin gene families in three phaseolus species: Genome-wide identification, evolutionary analysis, pleiotropic effect, and regulation under multiple stress conditions. Journal of Molecular Evolution 94, 28–51 (2025).

Eggermont, L., Verstraeten, B. & Van Damme, E. J. M. Genome-wide screening for lectin motifs in arabidopsis thaliana. The Plant Genome 10, plantgenome2017.02.0010 (2017).

Dang, L. & Van Damme, E. J. M. Genome-wide identification and domain organization of lectin domains in cucumber. Plant Physiology and Biochemistry 108, 165–176 (2016).

Tsaneva, M., De Schutter, K., Verstraeten, B. & Van Damme, E. J. M. Lectin sequence distribution in QTLs from rice (oryza sativa) suggest a role in morphological traits and stress responses. International Journal of Molecular Sciences 20, 437 (2019).

Van Holle, S. & Van Damme, E. Distribution and evolution of the lectin family in soybean (glycine max). Molecules 20, 2868–2891 (2015).

10.

Osman, M. E. M., Dirar, A. I. & Konozy, E. H. E. Genome-wide screening of lectin putative genes from sorghum bicolor l., distribution in QTLs and a probable implications of lectins in abiotic stress tolerance. BMC Plant Biology 22, (2022).