Sequence Identification
plant lectins, genome-wide identification, BLASTp, HMMER, InterProScan, Phytozome, lectin families, pfam domains
Sequence Identification
Objective of this section
Identify all putative lectin-encoding genes in a plant genome by searching for sequence similarity to reference model sequences and annotated Pfam lectin domains. The accuracy of this step is critical because all subsequent analyses builds on it.
Reference model sequences and Pfam domains
The 12 plant lectin families recognized in the literature1, their reference model sequences, and the Pfam domains used for identification are listed below.
| Lectin Family | Model Organism | Lectin Domain | Pfam ID | Accession |
|---|---|---|---|---|
| ABA | Agaricus bisporus | FB_lectin | PF07367 | Q00022 |
| Amaranthin | Amaranthus caudatus | Agglutinin | PF07468 | AAL05954.1 |
| CRA | Robinia pseudoacacia | — | — | ABL98074.1 |
| Cyanovirin | Nostoc ellipsosporum | CVNH | PF08881 | P81180 |
| EUL | Euonymus europaeus | — | — | ABW73993.1 |
| GNA | Galanthus nivalis | B_lectin | PF01453 | P30617 |
| Hevein | Hevea brasiliensis | Chitin_bind_1 | PF00187 | P02877 |
| JRL | Artocarpus integer | Jacalin | PF01419 | AAA32680.1 |
| Legume lectin | Glycine max | Lectin_legB | PF00139 | P05046 |
| LysM | Brassica juncea | LysM | PF01476 | BAN83772.1 |
| Nictaba | Nicotiana tabacum | PP2 | PF14299 | AAK84134.1 |
| Ricin B | Ricinus communis | Ricin_B_lectin | PF00652 | 2AAI_B |
Tools and databases
| Tool / Database | Tested Version | Platform | Purpose |
|---|---|---|---|
| BLAST+ (blastp) | 2.15.0 | Linux/Windows/macOS | Homology-based lectin identification |
| HMMER (hmmsearch) | 3.4 | Linux | Profile-based Pfam domain searching |
| InterProScan | 5.77-108.0 | Linux | Predicting domains and important sites |
| NCBI | — | Web | Genome assembly and annotation download |
| Ensembl Plants | — | Web | Genome assembly and annotation download |
| Phytozome | 14 | Web | Public repository of plant genomic resources |
| InterPro and/or Pfam2 | — | Web | Lectin domain HMM profiles |
| MapChart | 2.32 | Windows | Graphical presentation of linkage maps and QTLs |
| MG2C3 | 2.1 | Web | Online chromosomal map construction and visualization |
Phytozome-based Identification Workflow
Genome assemblies and predicted proteomes of available plant species are accessed through Phytozome4, a public repository of plant genomic resources.
Some plant species have standalone databases for genome assemblies and annotation, most of which also provide BLASTp search capabilities directly just like Phytozome-BLASTp.
Step 1. Each of the 12 model sequences from Table 1 is submitted as a separate BLASTp query against the target species’ predicted proteome using Phytozome-BLASTp.
A permissive E-value threshold is recommended because model sequences are from distantly related organisms and divergent family members in the target genome.
BLOSUM62 is the preferred comparison matrix for this search. Default word length is appropriate and does not need adjustment.
Step 2. Sequences with the highest identity percentage from Step 1 are selected and used as queries in a second BLASTp search against the same assembly. This improves detection of divergent family members that the cross-species model sequences may miss.
Step 3. All candidate sequences identified across both rounds are retrieved using the Phytozome-BioMart tool. Structural annotation data available for download include gene name, transcript name, chromosome and transcript coordinates, and gene description.
Step 4. Retrieved sequences are submitted to the InterPro web service for scanning against the InterPro protein signature databases, including Pfam, to predict domains and important sites.
Proteins with at least one lectin domain can be considered as putative lectins for downstream analysis. Many identified lectin genes encode chimeric proteins fused with additional functional domains such as protein kinases, and F-box domains.
Local InterProScan Workflow
When InterProScan is installed on a local server, the complete predicted proteome can be scanned in a single run and results filtered against the Pfam IDs in Table 1.
Step 1. Run InterProScan against the complete predicted proteome of the target species.
InterProScan applies gathering thresholds by default, ensuring only statistically significant domain matches are reported.
Step 2. Filter the output by the Pfam IDs listed in Table 1 to extract all sequences with a lectin domain.
Step 3. For CRA and EUL, which have no Pfam domain, a separate BLASTp search using their model sequences (Table 1) is still required.
Key Insights
- Not all 12 lectin families are detected in every plant genome. ABA, Amaranthin, and Cyanovirin families were not detected across multiple species including arabidopsis, phaseolus species, and cucumber.
- Legume and GNA families are usually the most abundant in plant genomes, though their relative numbers varies by species
- Most identified lectin genes encode chimeric proteins fused with additional functional domains.
Key Limitations
- Genome assembly and annotation quality directly affects the number of sequences retrieved. Fragmented assemblies and incomplete annotations can lead to underestimation of lectin gene numbers.
- No single database covers all 12 families equally; Pfam annotation is incomplete for several non-model lectin families.
Published Studies
Phaseolus Species5, Arabidopsis thaliana6, Cucumber (Cucumis sativus)7, Rice (Oryza sativa)8, soybean (Glycine max)9, and Sorghum (Sorghum bicolor)10