GLAST: A Novel Molecular Search & Sequence Alignment Engine

Combining Full-Text Indexing with 3-Letter k-mer Alignment

Overview

GLAST (GenesisL1 Local Alignment & Search Tool) is a single-file Python application that encapsulates both full-text search (via Whoosh) and local sequence alignment (via Parasail), allowing developers to rapidly deploy advanced search capabilities for molecular data and metadata such as protein or nucleotide sequences. The entire system can be packaged in one Python script, providing a powerful yet lightweight solution for scientific data exploration. It is capable of performing advanced search and alignments quickly in a dataset of millions of sequences and corresponding plain text metadata.

Core Innovations

Scientific Deep-Dive

Traditional sequence search tools (like BLAST) rely on indexing or seeding strategies that revolve around hash tables of short k-mer matches. By contrast, GLAST uses a standard text search engine (Whoosh) — commonly used for web or textual data — to create an index of 3-letter “words” from sequences. This might look unusual, but in practice it becomes a powerful filter stage. Each sequence in the database is broken into contiguous 3-letter windows, e.g. QEE -> EEY -> EYA -> YAK ....

The novelty lies in bridging the gap between text-based “fuzzy” indexing of sequences and standard metadata-based fields (like IDCODE, COMPOUND, HEADER), all in a single library. Once a subset of sequences is found via Whoosh queries, GLAST hands them off to Parasail’s Smith-Waterman local alignment routine, computing alignment scores (and optional alignment strings) to rank the final results.

How 3-Letter k-mers & Whoosh Work

The approach to index a sequence of length N using 3-letter windows means each sequence is treated as though it were a text of overlapping “words.” For example:

Sequence: QEEYAK...

3-letter k-mers:
QEE
EEY
EYA
YAK
...
  

These tokens get stored in Whoosh's index, letting us query them (e.g. searching for “EEY” or partial matches). Because Whoosh natively supports text fields, we can also store other fields like HEADER, SOURCE, etc. Queries can combine these with standard textual queries (title: “helicase”), plus sequence windows (sequence: “QEE”).

Parasail Local Alignment

After the initial filtering via the index, GLAST calls Parasail to compute a local alignment score (Smith-Waterman) for each candidate sequence. This alignment is purely local, meaning it tries to find the highest-scoring local region between the query and subject. It’s well-suited for protein or partial matches. The combination ensures minimal overhead: you only run costly alignment on a subset of sequences likely to match in the first place.

Single-File Python Script

All of these components—Whoosh indexing, Parasail alignment, Flask REST server—are integrated into a single `.py` file. This drastically simplifies deployment. A lab can simply run:

python glast.py
  

...and immediately have a local or network-accessible service for searching a JSON dataset with both free-text and sequence queries.


GLAST API Methods & Examples

Below is a reference for the GLAST REST API endpoints. These endpoints are typically served on http://hostname:8899 or behind a proxy such as http://api.molnft.org/ depending on your setup.

GET /api/nfts

Description: Returns a quick paginated list of NFTs from the Whoosh index (no alignment, no session).

Query Parameters:

Param Description Default
query Basic search term (applied across IDCODE, HEADER, COMPOUND, etc.) (empty - returns all docs)
page 1-based page number 1
limit Page size (max results per page) 20

Example:

GET http://api.molnft.org/api/nfts?query=DNA&page=1&limit=20

Sample Response:

{
  "status": "success",
  "page": 1,
  "limit": 20,
  "totalItems": 86,
  "items": [
    {
      "NFTID": "42",
      "IDCODE": "101D",
      "HEADER": "MY HEADER",
      "ACCESSION_DATE": "2023-01-01",
      "COMPOUND": "HELICASE",
      "SOURCE": "HUMAN",
      "AUTHOR_LIST": "Smith, Jones",
      "RESOLUTION": "2.0A",
      "EXPERIMENT_TYPE": "X-RAY",
      "SEQUENCE": "QEEY..."
    },
    ...
  ]
}

POST /api/align

Description: Performs a single or multiple local sequence alignments via parasail.

Request Body (JSON):

{
  "query": "QEEY",
  "subject": "XXXXXX"
}
  
{
  "query": "QEEY",
  "subjects": ["XXXXXX", "YYYYYY"]
}
  

Example 1: Single query/subject

curl -X POST -H "Content-Type: application/json" \ -d '{"query":"QEEY","subject":"EEEYGGSS"}' \ http://api.molnft.org/api/align

Sample Single-Subject Response:

{
  "status": "success",
  "alignment": {
    "score": 52,
    "aligned_query": "QEEY--",
    "aligned_ref":   "QEEEYY",
    "middle":        "|||--"
  }
}

Example 2: Single query vs multiple subjects

curl -X POST -H "Content-Type: application/json" \ -d '{"query":"QEEY","subjects":["XXXXXX","YYYYYY"]}' \ http://api.molnft.org/api/align

Sample Multi-Subject Response:

{
  "status": "success",
  "alignments": [
    {
      "score": 47,
      "aligned_query": "QEEY",
      "aligned_ref":   "XXXXXX",
      "middle":        " |  "
    },
    {
      "score": 10,
      "aligned_query": "QEEY---",
      "aligned_ref":   "YYYYYY",
      "middle":        "|| "
    }
  ]
}

GET /api/search_and_align

Description: A session-based search that can combine a basic text search and an align sequence, producing an intersection plus alignment scores. Results are stored in a server session keyed by session (UUID).

Query Parameters (New Search):

ParamDescription
basic Text query across IDCODE, HEADER, COMPOUND, etc.
align A sequence to align. If provided, results are sorted by alignmentScore.
page Page number
limit Page size

Query Parameters (Existing Session):

ParamDescription
session UUID returned from the first call
page Page number to fetch from the existing doc set
limit Page size

Example 1: New search + align:

GET /api/search_and_align?basic=Motor&align=QEEY&page=1&limit=20

Sample Response:

{
  "status": "success",
  "session": "5bfc9368-d348-4f24-a2b6-46365c6db3bc",
  "page": 1,
  "limit": 20,
  "totalItems": 12,
  "items": [
    {
      "NFTID": "42",
      "IDCODE": "101D",
      "HEADER": "MOTOR PROTEIN",
      "alignmentScore": 53,
      "aligned_query": "QEEY---",
      "aligned_ref":   "QEEYXXX",
      "aligned_middle": "|||   "
      ...
    },
    ...
  ]
}

Subsequent calls can reference the session param to fetch page 2, etc. For example:

GET /api/search_and_align?session=5bfc9368-d348-4f24-a2b6-46365c6db3bc&page=2&limit=20

GET /api/nft_by_idcode

Description: Returns a single numeric NFTID for the given IDCODE, if it exists in the index.

Query Param: ?code=IDCODE

Example:

GET http://api.molnft.org/api/nft_by_idcode?code=101D

Sample Success Response:

{
  "status": "success",
  "NFTID": "42"
}

Possible Errors:


General Error Handling

In most cases, if an error occurs (e.g., invalid parameters, missing data, server error), the response is JSON in the form:

{
  "status": "error",
  "message": "Description of the error"
}

Versioning & Contact

This API is under continuous development. For questions or support, please contact the MolNFT team.
Version: 1.0