Combining Full-Text Indexing with 3-Letter k-mer Alignment
GLAST (GenesisL1 Local Alignment & Search Tool) is a single-file Python application that encapsulates both full-text search (via Whoosh) and local sequence alignment (via Parasail), allowing developers to rapidly deploy advanced search capabilities for molecular data and metadata such as protein or nucleotide sequences. The entire system can be packaged in one Python script, providing a powerful yet lightweight solution for scientific data exploration. It is capable of performing advanced search and alignments quickly in a dataset of millions of sequences and corresponding plain text metadata.
Traditional sequence search tools (like BLAST) rely on indexing or seeding strategies that revolve around
hash tables of short k-mer matches. By contrast, GLAST uses a standard text search engine (Whoosh)
— commonly used for web or textual data — to create an index of 3-letter “words” from sequences.
This might look unusual, but in practice it becomes a powerful filter stage. Each sequence in the database is
broken into contiguous 3-letter windows, e.g. QEE -> EEY -> EYA -> YAK ...
.
The novelty lies in bridging the gap between text-based “fuzzy” indexing of sequences and standard metadata-based fields
(like IDCODE
, COMPOUND
, HEADER
), all in a single library.
Once a subset of sequences is found via Whoosh queries, GLAST hands them off to Parasail’s Smith-Waterman local alignment
routine, computing alignment scores (and optional alignment strings) to rank the final results.
The approach to index a sequence of length N using 3-letter windows means each sequence is treated as though it were a text of overlapping “words.” For example:
Sequence: QEEYAK... 3-letter k-mers: QEE EEY EYA YAK ...
These tokens get stored in Whoosh's index, letting us query them (e.g. searching for “EEY” or partial matches).
Because Whoosh natively supports text fields, we can also store other fields like HEADER
,
SOURCE
, etc. Queries can combine these with standard textual queries
(title: “helicase”), plus sequence windows (sequence: “QEE”).
After the initial filtering via the index, GLAST calls Parasail to compute a local alignment score (Smith-Waterman) for each candidate sequence. This alignment is purely local, meaning it tries to find the highest-scoring local region between the query and subject. It’s well-suited for protein or partial matches. The combination ensures minimal overhead: you only run costly alignment on a subset of sequences likely to match in the first place.
All of these components—Whoosh indexing, Parasail alignment, Flask REST server—are integrated into a single `.py` file. This drastically simplifies deployment. A lab can simply run:
python glast.py
...and immediately have a local or network-accessible service for searching a JSON dataset with both free-text and sequence queries.
Below is a reference for the GLAST REST API endpoints. These endpoints are typically served
on http://hostname:8899
or behind a proxy such as http://api.molnft.org/
depending on your setup.
Description: Returns a quick paginated list of NFTs from the Whoosh index (no alignment, no session).
Query Parameters:
Param | Description | Default |
---|---|---|
query |
Basic search term (applied across IDCODE, HEADER, COMPOUND, etc.) | (empty - returns all docs) |
page |
1-based page number | 1 |
limit |
Page size (max results per page) | 20 |
Example:
GET http://api.molnft.org/api/nfts?query=DNA&page=1&limit=20
Sample Response:
{ "status": "success", "page": 1, "limit": 20, "totalItems": 86, "items": [ { "NFTID": "42", "IDCODE": "101D", "HEADER": "MY HEADER", "ACCESSION_DATE": "2023-01-01", "COMPOUND": "HELICASE", "SOURCE": "HUMAN", "AUTHOR_LIST": "Smith, Jones", "RESOLUTION": "2.0A", "EXPERIMENT_TYPE": "X-RAY", "SEQUENCE": "QEEY..." }, ... ] }
Description: Performs a single or multiple local sequence alignments via parasail
.
Request Body (JSON):
{ "query": "QEEY", "subject": "XXXXXX" }
{ "query": "QEEY", "subjects": ["XXXXXX", "YYYYYY"] }
Example 1: Single query/subject
curl -X POST -H "Content-Type: application/json" \
-d '{"query":"QEEY","subject":"EEEYGGSS"}' \
http://api.molnft.org/api/align
Sample Single-Subject Response:
{ "status": "success", "alignment": { "score": 52, "aligned_query": "QEEY--", "aligned_ref": "QEEEYY", "middle": "|||--" } }
Example 2: Single query vs multiple subjects
curl -X POST -H "Content-Type: application/json" \
-d '{"query":"QEEY","subjects":["XXXXXX","YYYYYY"]}' \
http://api.molnft.org/api/align
Sample Multi-Subject Response:
{ "status": "success", "alignments": [ { "score": 47, "aligned_query": "QEEY", "aligned_ref": "XXXXXX", "middle": " | " }, { "score": 10, "aligned_query": "QEEY---", "aligned_ref": "YYYYYY", "middle": "|| " } ] }
Description: A session-based search that can combine a basic
text search and an align
sequence, producing an intersection plus alignment scores. Results are stored in a server session keyed by session
(UUID).
Query Parameters (New Search):
Param | Description |
---|---|
basic |
Text query across IDCODE, HEADER, COMPOUND, etc. |
align |
A sequence to align. If provided, results are sorted by alignmentScore. |
page |
Page number |
limit |
Page size |
Query Parameters (Existing Session):
Param | Description |
---|---|
session |
UUID returned from the first call |
page |
Page number to fetch from the existing doc set |
limit |
Page size |
Example 1: New search + align:
GET /api/search_and_align?basic=Motor&align=QEEY&page=1&limit=20
Sample Response:
{ "status": "success", "session": "5bfc9368-d348-4f24-a2b6-46365c6db3bc", "page": 1, "limit": 20, "totalItems": 12, "items": [ { "NFTID": "42", "IDCODE": "101D", "HEADER": "MOTOR PROTEIN", "alignmentScore": 53, "aligned_query": "QEEY---", "aligned_ref": "QEEYXXX", "aligned_middle": "||| " ... }, ... ] }
Subsequent calls can reference the session
param to fetch page 2, etc. For example:
GET /api/search_and_align?session=5bfc9368-d348-4f24-a2b6-46365c6db3bc&page=2&limit=20
Description: Returns a single numeric NFTID
for the given IDCODE, if it exists in the index.
Query Param: ?code=IDCODE
Example:
GET http://api.molnft.org/api/nft_by_idcode?code=101D
Sample Success Response:
{ "status": "success", "NFTID": "42" }
Possible Errors:
400
– If code
param is missing404
– If the IDCODE is not foundIn most cases, if an error occurs (e.g., invalid parameters, missing data, server error), the response is JSON in the form:
{ "status": "error", "message": "Description of the error" }
This API is under continuous development. For questions or support, please contact the MolNFT team.
Version: 1.0