Building an AI-Powered Search Engine Using LangChain and Agents


The rise of Generative AI has transformed how we interact with information. Large Language Models (LLMs) like Llama3-8b-8192 can process complex queries, but they lack real-time search capabilities. To address this, I built an AI-powered search engine using LangChain Agents, integrating Wikipedia, Arxiv, and DuckDuckGo search tools to provide real-time, context-aware responses.

This project not only improves search accuracy but also showcases the power of Retrieval-Augmented Generation (RAG)—a technique that enhances LLMs by fetching live information before generating responses.


💡 How This Search Engine Works

🔹 Core Components Used

This project integrates multiple AI and search technologies:

Component Purpose
LangChain Framework to build AI applications with LLMs
Streamlit Web interface for interactive AI search
DuckDuckGo API Fetches real-time web results
Wikipedia API Retrieves encyclopedic knowledge
Arxiv API Fetches academic research papers
FAISS Stores vectorized text for efficient retrieval
Hugging Face Transformers Embedding models for RAG
MySQL + SQLAlchemy Stores search logs for analytics

🔹 Tech Stack & Dependencies

To build this project, I used the following Python libraries:

langchain, langchain-community, langchain-openai, langchain-groq, langchain_huggingface
streamlit, python-dotenv, pypdf, arxiv, wikipedia, sentence_transformers, faiss-cpu
chromadb, duckdb, pandas, mysql-connector-python, SQLAlchemy, validators, pytube

These tools help integrate LLMs, search APIs, document retrieval, vector storage, and database management.


🚀 Project Overview: How It Works

Step 1: Setting Up Search Tools (Wikipedia, Arxiv, and DuckDuckGo)

To retrieve live data, we use Wikipedia API, Arxiv API, and DuckDuckGo search:

from langchain_community.utilities import ArxivAPIWrapper, WikipediaAPIWrapper
from langchain_community.tools import ArxivQueryRun, WikipediaQueryRun, DuckDuckGoSearchRun

# Setup Wikipedia tool
wiki_api_wrapper = WikipediaAPIWrapper(top_k_results=1, doc_content_chars_max=200)
wiki = WikipediaQueryRun(api_wrapper=wiki_api_wrapper)

# Setup Arxiv tool
arxiv_wrapper = ArxivAPIWrapper(top_k_results=1, doc_content_chars_max=200)
arxiv = ArxivQueryRun(api_wrapper=arxiv_wrapper)

# Setup DuckDuckGo web search tool
search = DuckDuckGoSearchRun(name="Search")

🔹 What This Does:

  • Wikipedia Tool → Fetches summarized encyclopedic content.
  • Arxiv Tool → Finds academic research papers.
  • DuckDuckGo Tool → Searches the live web for fresh content.

Step 2: Creating a Search Agent (The AI Brain)

We need an Agent that can think step by step and decide which tool to use.

from langchain_groq import ChatGroq
from langchain.agents import initialize_agent, AgentType

# Load API Key from .env file
import os
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv("GROQ_API_KEY")

# Initialize LLM Model (Llama3-8b-8192)
llm = ChatGroq(groq_api_key=api_key, model_name="Llama3-8b-8192", streaming=True)

# Define available tools
tools = [search, arxiv, wiki]

# Create an AI Agent that can decide which tool to use
search_agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, handle_parsing_errors=True)

🔹 What This Does:

  • Loads the AI model (Llama3-8b-8192) to process search queries.
  • Combines all search tools into one system.
  • Uses an Agent to decide whether to search Wikipedia, Arxiv, or the Web.

Step 3: Building an Interactive UI with Streamlit

To allow users to interact with the AI Search Engine, we built a Streamlit UI:

import streamlit as st
from langchain.callbacks import StreamlitCallbackHandler

# Streamlit UI Setup
st.title("🔎 AI-Powered Search Engine")

st.sidebar.title("Settings")
api_key = st.sidebar.text_input("Enter your Groq API Key:", type="password")

if "messages" not in st.session_state:
    st.session_state["messages"] = [{"role": "assistant", "content": "Hi! I can search the web. Ask me anything."}]

for msg in st.session_state.messages:
    st.chat_message(msg["role"]).write(msg["content"])

if prompt := st.chat_input("What do you want to search?"):
    st.session_state.messages.append({"role": "user", "content": prompt})
    st.chat_message("user").write(prompt)

    # Invoke the AI Search Agent
    with st.chat_message("assistant"):
        st_cb = StreamlitCallbackHandler(st.container(), expand_new_thoughts=False)
        response = search_agent.run(st.session_state.messages, callbacks=[st_cb])
        st.session_state.messages.append({'role': 'assistant', "content": response})
        st.write(response)

🔹 What This Does:

  • Displays chat history between user and AI.
  • Allows real-time user input.
  • AI processes the query and returns results.

🎯 Key Features & Benefits

Retrieval-Augmented Generation (RAG) → Combines AI with live search results.
Multi-Source Search → Wikipedia, Arxiv, and DuckDuckGo provide accurate, up-to-date answers.
AI Decision-Making → The Agent chooses the right tool for each query.
Fast & Scalable → Uses FAISS vector storage and LangChain agents.


🔗 Live Demo & GitHub Repository

  • GitHub Code: [Insert GitHub Link Here]
  • Live Demo: [Insert Deployed Link Here]

📌 Final Thoughts: What I Learned

🔹 AI alone is not enough → Real-time retrieval tools enhance accuracy.
🔹 LangChain makes it easyAgents, Tools, and Executors simplify AI workflows.
🔹 Deployment matters → Hosting AI-powered search engines can help businesses make data-driven decisions.

🚀 Next Steps:
I plan to enhance this project by:
1️⃣ Adding PDF & YouTube Video Summarization 📑.
2️⃣ Improving the search accuracy with embeddings 🧠.
3️⃣ Deploying a full-scale API 🌍.

What do you think of this project? Feel free to share your feedback! 😊


🚀 Want to Build Your Own AI Search Engine?

If you’re interested in building your own AI-powered search tool, let’s connect on LinkedIn!

👉 [LinkedIn Profile]
👉 [GitHub Repository]

😊

Comments

Popular posts from this blog

Heart Failure Outcome Prediction: A Machine Learning Approach

Building detecting SMS Spam Using Text Classification Project