• About
  • Disclaimer
  • Privacy Policy
  • Contact
Sunday, June 15, 2025
Cyber Defense GO
  • Login
  • Home
  • Cyber Security
  • Artificial Intelligence
  • Machine Learning
  • Data Analysis
  • Computer Networking
  • Disaster Restoration
No Result
View All Result
  • Home
  • Cyber Security
  • Artificial Intelligence
  • Machine Learning
  • Data Analysis
  • Computer Networking
  • Disaster Restoration
No Result
View All Result
Cyber Defense Go
No Result
View All Result
Home Artificial Intelligence

A Step-by-Step Information to Constructing a Semantic Search Engine with Sentence Transformers, FAISS, and all-MiniLM-L6-v2

Md Sazzad Hossain by Md Sazzad Hossain
0
A Step-by-Step Information to Constructing a Semantic Search Engine with Sentence Transformers, FAISS, and all-MiniLM-L6-v2
585
SHARES
3.2k
VIEWS
Share on FacebookShare on Twitter


Semantic search goes past conventional key phrase matching by understanding the contextual that means of search queries. As a substitute of merely matching precise phrases, semantic search programs seize the intent and contextual definition of the question and return related outcomes even once they don’t include the identical key phrases.

On this tutorial, we’ll implement a semantic search system utilizing Sentence Transformers, a strong library constructed on high of Hugging Face’s Transformers that gives pre-trained fashions particularly optimized for producing sentence embeddings. These embeddings are numerical representations of textual content that seize semantic that means, permitting us to search out comparable content material via vector similarity. We’ll create a sensible software: a semantic search engine for a set of scientific abstracts that may reply analysis queries with related papers, even when the terminology differs between the question and related paperwork.

You might also like

Ctrl-Crash: Ny teknik för realistisk simulering av bilolyckor på video

Why Creators Are Craving Unfiltered AI Video Mills

6 New ChatGPT Tasks Options You Have to Know

First, let’s set up the mandatory libraries in our Colab pocket book:

!pip set up sentence-transformers faiss-cpu numpy pandas matplotlib datasets

Now, let’s import the libraries we’ll want:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
import faiss
from typing import Record, Dict, Tuple
import time
import re
import torch

For our demonstration, we’ll use a set of scientific paper abstracts. Let’s create a small dataset of abstracts from numerous fields:

abstracts = [
    {
        "id": 1,
        "title": "Deep Learning for Natural Language Processing",
        "abstract": "This paper explores recent advances in deep learning models for natural language processing tasks. We review transformer architectures including BERT, GPT, and T5, and analyze their performance on various benchmarks including question answering, sentiment analysis, and text classification."
    },
    {
        "id": 2,
        "title": "Climate Change Impact on Marine Ecosystems",
        "abstract": "Rising ocean temperatures and acidification are severely impacting coral reefs and marine biodiversity. This study presents data collected over a 10-year period, demonstrating accelerated decline in reef ecosystems and proposing conservation strategies to mitigate further damage."
    },
    {
        "id": 3,
        "title": "Advancements in mRNA Vaccine Technology",
        "abstract": "The development of mRNA vaccines represents a breakthrough in immunization technology. This review discusses the mechanism of action, stability improvements, and clinical efficacy of mRNA platforms, with special attention to their rapid deployment during the COVID-19 pandemic."
    },
    {
        "id": 4,
        "title": "Quantum Computing Algorithms for Optimization Problems",
        "abstract": "Quantum computing offers potential speedups for solving complex optimization problems. This paper presents quantum algorithms for combinatorial optimization and compares their theoretical performance with classical methods on problems including traveling salesman and maximum cut."
    },
    {
        "id": 5,
        "title": "Sustainable Urban Planning Frameworks",
        "abstract": "This research proposes frameworks for sustainable urban development that integrate renewable energy systems, efficient public transportation networks, and green infrastructure. Case studies from five cities demonstrate reductions in carbon emissions and improvements in quality of life metrics."
    },
    {
        "id": 6,
        "title": "Neural Networks for Computer Vision",
        "abstract": "Convolutional neural networks have revolutionized computer vision tasks. This paper examines recent architectural innovations including residual connections, attention mechanisms, and vision transformers, evaluating their performance on image classification, object detection, and segmentation benchmarks."
    },
    {
        "id": 7,
        "title": "Blockchain Applications in Supply Chain Management",
        "abstract": "Blockchain technology enables transparent and secure tracking of goods throughout supply chains. This study analyzes implementations across food, pharmaceutical, and retail industries, quantifying improvements in traceability, reduction in counterfeit products, and enhanced consumer trust."
    },
    {
        "id": 8,
        "title": "Genetic Factors in Autoimmune Disorders",
        "abstract": "This research identifies key genetic markers associated with increased susceptibility to autoimmune conditions. Through genome-wide association studies of 15,000 patients, we identified novel variants that influence immune system regulation and may serve as targets for personalized therapeutic approaches."
    },
    {
        "id": 9,
        "title": "Reinforcement Learning for Robotic Control Systems",
        "abstract": "Deep reinforcement learning enables robots to learn complex manipulation tasks through trial and error. This paper presents a framework that combines model-based planning with policy gradient methods to achieve sample-efficient learning of dexterous manipulation skills."
    },
    {
        "id": 10,
        "title": "Microplastic Pollution in Freshwater Systems",
        "abstract": "This study quantifies microplastic contamination across 30 freshwater lakes and rivers, identifying primary sources and transport mechanisms. Results indicate correlation between population density and contamination levels, with implications for water treatment policies and plastic waste management."
    }
]


papers_df = pd.DataFrame(abstracts)
print(f"Dataset loaded with {len(papers_df)} scientific papers")
papers_df[["id", "title"]]

Now we’ll load a pre-trained Sentence Transformer mannequin from Hugging Face. We’ll use the all-MiniLM-L6-v2 mannequin, which offers a very good steadiness between efficiency and pace:

model_name="all-MiniLM-L6-v2"
mannequin = SentenceTransformer(model_name)
print(f"Loaded mannequin: {model_name}")

Subsequent, we’ll convert our textual content abstracts into dense vector embeddings:

paperwork = papers_df['abstract'].tolist()
document_embeddings = mannequin.encode(paperwork, show_progress_bar=True)


print(f"Generated {len(document_embeddings)} embeddings with dimension {document_embeddings.form[1]}")

FAISS (Fb AI Similarity Search) is a library for environment friendly similarity search. We’ll use it to index our doc embeddings:

dimension = document_embeddings.form[1]  


index = faiss.IndexFlatL2(dimension)
index.add(np.array(document_embeddings).astype('float32'))


print(f"Created FAISS index with {index.ntotal} vectors")

Now let’s implement a operate that takes a question, converts it to an embedding, and retrieves probably the most comparable paperwork:

def semantic_search(question: str, top_k: int = 3) -> Record[Dict]:
    """
    Seek for paperwork just like question


    Args:
        question: Textual content to seek for
        top_k: Variety of outcomes to return


    Returns:
        Record of dictionaries containing doc information and similarity rating
    """
    query_embedding = mannequin.encode([query])


    distances, indices = index.search(np.array(query_embedding).astype('float32'), top_k)


    outcomes = []
    for i, idx in enumerate(indices[0]):
        outcomes.append({
            'id': papers_df.iloc[idx]['id'],
            'title': papers_df.iloc[idx]['title'],
            'summary': papers_df.iloc[idx]['abstract'],
            'similarity_score': 1 - distances[0][i] / 2  
        })


    return outcomes

Let’s check our semantic search with numerous queries that reveal its skill to grasp that means past precise key phrases:

test_queries = [
    "How do transformers work in natural language processing?",
    "What are the effects of global warming on ocean life?",
    "Tell me about COVID vaccine development",
    "Latest algorithms in quantum computing",
    "How can cities reduce their carbon footprint?"
]


for question in test_queries:
    print("n" + "="*80)
    print(f"Question: {question}")
    print("="*80)


    outcomes = semantic_search(question, top_k=3)


    for i, lead to enumerate(outcomes):
        print(f"nResult #{i+1} (Rating: {outcome['similarity_score']:.4f}):")
        print(f"Title: {outcome['title']}")
        print(f"Summary snippet: {outcome['abstract'][:150]}...")

Let’s visualize the doc embeddings to see how they cluster by subject:

from sklearn.decomposition import PCA


pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(document_embeddings)


plt.determine(figsize=(12, 8))
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], s=100, alpha=0.7)


for i, (x, y) in enumerate(reduced_embeddings):
    plt.annotate(papers_df.iloc[i]['title'][:20] + "...",
                 (x, y),
                 fontsize=9,
                 alpha=0.8)


plt.title('Doc Embeddings Visualization (PCA)')
plt.xlabel('Part 1')
plt.ylabel('Part 2')
plt.grid(True, linestyle="--", alpha=0.7)
plt.tight_layout()
plt.present()

Let’s create a extra interactive search interface:

from IPython.show import show, HTML, clear_output
import ipywidgets as widgets


def run_search(query_text):
    clear_output(wait=True)


    show(HTML(f"

Question: {query_text}

")) start_time = time.time() outcomes = semantic_search(query_text, top_k=5) search_time = time.time() - start_time show(HTML(f"

Discovered {len(outcomes)} leads to {search_time:.4f} seconds

")) for i, lead to enumerate(outcomes): html = f"""

{i+1}. {outcome['title']} (Rating: {outcome['similarity_score']:.4f})

{outcome['abstract']}

""" show(HTML(html)) search_box = widgets.Textual content( worth="", placeholder="Kind your search question right here...", description='Search:', structure=widgets.Format(width="70%") ) search_button = widgets.Button( description='Search', button_style="major", tooltip='Click on to go looking' ) def on_button_clicked(b): run_search(search_box.worth) search_button.on_click(on_button_clicked) show(widgets.HBox([search_box, search_button]))

On this tutorial, we’ve constructed an entire semantic search system utilizing Sentence Transformers. This method can perceive the that means behind person queries and return related paperwork even when there isn’t precise key phrase matching. We’ve seen how embedding-based search offers extra clever outcomes than conventional strategies.


Right here is the Colab Pocket book. Additionally, don’t overlook to observe us on Twitter and be part of our Telegram Channel and LinkedIn Group. Don’t Neglect to hitch our 85k+ ML SubReddit.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

Tags: allMiniLML6v2BuildingEngineFAISSGuideSearchSemanticSentenceStepbyStepTransformers
Previous Post

Construct a generative AI enabled digital IT troubleshooting assistant utilizing Amazon Q Enterprise

Next Post

Routed Interfaces on Layer-3 Switches and Inner VLANs « ipSpace.internet weblog

Md Sazzad Hossain

Md Sazzad Hossain

Related Posts

Artificial Intelligence

Ctrl-Crash: Ny teknik för realistisk simulering av bilolyckor på video

by Md Sazzad Hossain
June 15, 2025
Why Creators Are Craving Unfiltered AI Video Mills
Artificial Intelligence

Why Creators Are Craving Unfiltered AI Video Mills

by Md Sazzad Hossain
June 14, 2025
6 New ChatGPT Tasks Options You Have to Know
Artificial Intelligence

6 New ChatGPT Tasks Options You Have to Know

by Md Sazzad Hossain
June 14, 2025
combining generative AI with live-action filmmaking
Artificial Intelligence

combining generative AI with live-action filmmaking

by Md Sazzad Hossain
June 14, 2025
Photonic processor may streamline 6G wi-fi sign processing | MIT Information
Artificial Intelligence

Photonic processor may streamline 6G wi-fi sign processing | MIT Information

by Md Sazzad Hossain
June 13, 2025
Next Post
Routed Interfaces on Layer-3 Switches and Inner VLANs « ipSpace.internet weblog

Routed Interfaces on Layer-3 Switches and Inner VLANs « ipSpace.internet weblog

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

YouTube Exams AI Function That Will Utterly Change How You Seek for Movies

YouTube Exams AI Function That Will Utterly Change How You Seek for Movies

April 27, 2025
AI Pioneers Win Nobel Prizes for Physics and Chemistry

AI Pioneers Win Nobel Prizes for Physics and Chemistry

March 29, 2025

Categories

  • Artificial Intelligence
  • Computer Networking
  • Cyber Security
  • Data Analysis
  • Disaster Restoration
  • Machine Learning

CyberDefenseGo

Welcome to CyberDefenseGo. We are a passionate team of technology enthusiasts, cybersecurity experts, and AI innovators dedicated to delivering high-quality, insightful content that helps individuals and organizations stay ahead of the ever-evolving digital landscape.

Recent

Ctrl-Crash: Ny teknik för realistisk simulering av bilolyckor på video

June 15, 2025
Addressing Vulnerabilities in Positioning, Navigation and Timing (PNT) Companies

Addressing Vulnerabilities in Positioning, Navigation and Timing (PNT) Companies

June 14, 2025

Search

No Result
View All Result

© 2025 CyberDefenseGo - All Rights Reserved

No Result
View All Result
  • Home
  • Cyber Security
  • Artificial Intelligence
  • Machine Learning
  • Data Analysis
  • Computer Networking
  • Disaster Restoration

© 2025 CyberDefenseGo - All Rights Reserved

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In