RAG Chatbot for Website

Overview

This tutorial demonstrates how to:

Set up a web crawler using Julep’s Spider integration
Process and store crawled content in a document store
Implement RAG for enhanced AI responses
Create an intelligent agent that can answer questions about crawled content

Task Structure

Let’s break down the task into its core components:

1. Input Schema

First, we define what inputs our task expects:

input_schema:
  type: object
  properties:
    url:
      type: string
    reducing_strength:
      type: integer

This schema specifies that our task expects:

A URL string (e.g., “https://en.wikipedia.org/wiki/Artificial_intelligence”)
Number of sentences to club together to reduce the size of the chunks list (e.g., 5)

2. Tools Configuration

Next, we define the external tools our task will use:

- name: get_page
  type: api_call
  api_call:
    method: GET
    url: https://r.jina.ai/
    headers:
      accept: application/json
      x-return-format: markdown
      x-with-images-summary: "true"
      x-with-links-summary: "true"
      x-retain-images: "none"
      x-no-cache: "true"
      Authorization: "Bearer JINA_API_KEY"

- name: create_agent_doc
  type: system
  system:
    resource: agent
    subresource: doc
    operation: create

We’re using two tools:

The get_page api call for web crawling
The create_agent_doc system tool for storing processed content

3. Main Workflow Steps

Crawl Website

- tool: get_page
  arguments:
    url: $ "https://r.jina.ai/" + steps[0].input.url
  
- evaluate:
    result: $ chunk_doc(_.json.data.content.strip())

- workflow: index_page
  arguments:
    content: $ _.result
    document: $ steps[0].output.json.data.content.strip()
    reducing_strength: $ steps[0].input.reducing_strength

Understanding the use of the _ variable

The _ variable refers to the current context object. When accessing properties like _['url'], it’s retrieving values from the input parameters passed to the task.

This step:

Takes the input URL and crawls the website
Processes content into readable markdown format
Chunks content into manageable segments
Filters out unnecessary elements like images and SVGs

Process and Index Content

- evaluate:
    document: $ _.document
    chunks: |
      $ [" ".join(_.content[i:i + max(_.reducing_strength, len(_.content) // 9)]) 
        for i in range(0, len(_.content), max(_.reducing_strength, len(_.content) // 9))]
  label: docs

# Step 1: Create a new document and add it to the agent docs store
- over: $ [(steps[0].input.document, chunk.strip()) for chunk in _.chunks]
  parallelism: 3
  map:
    prompt: 
    - role: user
      content: >-
        $ f'''
        <document>
        {_[0]}  
        </document>

        Here is the chunk we want to situate within the whole document
        <chunk>
        {_[1]}
        </chunk>

        Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. 
        Answer only with the succinct context and nothing else.
        '''
    unwrap: true
    settings:
      max_tokens: 16000

- evaluate:
    final_chunks: |
      $ [
        NEWLINE.join([succint, chunk.strip()]) for chunk, succint in zip(steps['docs'].output.chunks, _)
      ]

This step:

Processes each content chunk in parallel
Generates contextual metadata for improved retrieval
Prepares content for storage

Store Documents

- over: $ _['final_chunks']
  parallelism: 3
  map:
    tool: create_agent_doc
    arguments:
      agent_id: $ agent.id
      data:
        metadata:
          source: jina_crawler
        title: Website Document
        content: $ _

This step:

Stores processed content in the document store
Adds metadata for source tracking
Creates searchable documents for RAG

Complete Task YAML

YAML

# yaml-language-server: $schema=https://raw.githubusercontent.com/julep-ai/julep/refs/heads/dev/src/schemas/create_task_request.json
name: Julep Jina Crawler Task
description: A Julep agent that can crawl a website and store the content in the document store.

########################################################
################### INPUT SCHEMA #######################
########################################################

input_schema:
  type: object
  properties:
    url:
      type: string
    reducing_strength:
      type: integer
      
########################################################
################### TOOLS ##############################
########################################################

tools:
- name: get_page
  type: api_call
  api_call:
    method: GET
    url: https://r.jina.ai/
    headers:
      accept: application/json
    x-return-format: markdown
    x-with-images-summary: "true"
    x-with-links-summary: "true"
    x-retain-images: "none"
    x-no-cache: "true"
    Authorization: "Bearer JINA_API_KEY"

- name : create_agent_doc
  description: Create an agent doc
  type: system
  system:
    resource: agent
    subresource: doc
    operation: create

########################################################
################### INDEX PAGE SUBWORKFLOW ##############
########################################################

index_page:

# Step #0 - Evaluate the content
- evaluate:
    document: $ _.document
    chunks: |
      $ [" ".join(_.content[i:i + max(_.reducing_strength, len(_.content) // 9)]) 
        for i in range(0, len(_.content), max(_.reducing_strength, len(_.content) // 9))]
    label: docs

# Step #1 - Process each content chunk in parallel
- over: "$ [(steps[0].input.content, chunk) for chunk in _['chunks']]"
  parallelism: 3
  map:
    prompt: 
    - role: user
      content: >-
        $ f'''
        <document>
        {_[0]}
        </document>

        Here is the chunk we want to situate within the whole document
        <chunk>
        {_[1]}
        </chunk>

        Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. 
        Answer only with the succinct context and nothing else.'''
    
    unwrap: true
    settings:
      max_tokens: 16000

# Step #2 - Create a new document and add it to the agent docs store
- evaluate:
    final_chunks: |
      $ [
        NEWLINE.join([chunk, succint]) for chunk, succint in zip(steps[1].input.chunks, _)
      ]

# Step #3 - Create a new document and add it to the agent docs store
- over: $ _['final_chunks']
  parallelism: 3
  map:
    tool: create_agent_doc
    arguments:
      agent_id: "$ str(agent.id)" # <--- This is the agent id of the agent you want to add the document to
      data:
        metadata:
          source: "jina_crawler"

        title: "Website Document"
        content: $ _

########################################################
################### MAIN WORKFLOW ######################
########################################################

main:

# Step 0: Get the content of the product page
- tool: get_page
  arguments:
    url: $ "https://r.jina.ai/" + steps[0].input.url

# Step 1: Chunk the content
- evaluate:
    result: $ chunk_doc(_.json.data.content.strip())

# Step 2: Evaluate step to document chunks
- workflow: index_page
  arguments:
    content: $ _.result
    document: $ steps[0].output.json.data.content.strip()
    reducing_strength: $ steps[0].input.reducing_strength

Usage

Start by creating an execution for the task. This execution will make the agent crawl the website and store the content in the document store.

from julep import Client
import time
import yaml

# Initialize the client
client = Client(api_key=JULEP_API_KEY)

# Create the agent
agent = client.agents.create(
  name="Julep Jina Crawler Agent",
  about="A Julep agent that can crawl a website and store the content in the document store.",
)

# Load the task definition
with open('crawling_task.yaml', 'r') as file:
  task_definition = yaml.safe_load(file)

# Create the task
task = client.tasks.create(
  agent_id=agent.id,
  **task_definition
)

# Create the execution
execution = client.executions.create(
  task_id=task.id,
  input={"url": "https://en.wikipedia.org/wiki/Artificial_intelligence, "reducing_strength": 5}
)
# Wait for the execution to complete
while (result := client.executions.get(execution.id)).status not in ['succeeded', 'failed']:
    print(result.status)
    time.sleep(1)

if result.status == "succeeded":
    print(result.output)
else:
    print(f"Error: {result.error}")

Next, create a session for the agent. This session will be used to chat with the agent.

session = client.sessions.create(
  agent_id=AGENT_ID
)

Finally, chat with the agent.

response = client.sessions.chat(
  session_id=session.id,
  messages=[
    {
      "role": "user",
      "content": "tell me about artificial intelligence"
    }
  ]
)

print(response)

Example Output

This is an example output when the agent is asked “What is Julep?”

Output

Julep is a comprehensive platform designed for creating production-ready AI systems and agents. Here are the key aspects of Julep:Core Features:

Complete Infrastructure Layer

Provides infrastructure between LLMs and software
Built-in support for long-term memory
Multi-step process management
State management capabilities

AI Agent Development

Creates persistent AI agents that remember past interactions
Supports complex task execution
Enables multi-step workflows
Includes built-in tools and integrations

Production-Ready Features

Automatic retries for failed steps
Message resending capabilities
Task recovery systems
Real-time monitoring
Error handling
Automatic scaling and load balancing

Development Approach

Uses 8-Factor Agent methodology
Treats prompts as code with proper versioning
Provides clear tool interfaces
Offers model independence to avoid vendor lock-in
Includes structured reasoning capabilities
Maintains ground truth examples for validation

Available Resources:

Documentation: https://docs.julep.ai/
API Playground: https://dev.julep.ai/api/docs
Python SDK: https://github.com/julep-ai/python-sdk/blob/main/README.md
JavaScript SDK: https://github.com/julep-ai/node-sdk/blob/main/README.md
Various use case examples and cookbooks

You can explore different use cases through their cookbooks, including:

User Profiling
Email Assistant
Trip Planner
Document Management
Website Crawler
Multi-step Tasks
Advanced Chat Interactions

For additional support or to learn more:

Discord Community: https://discord.gg/2EUJzJU2Yt
Book a Demo: https://calendly.com/ishita-julep
Dev Support: [email protected]

Next Steps

Try this task yourself, check out the full example, see the RAG Chatbot cookbook.
To learn more about the integrations used in this task, check out the integrations page.

​Overview

​Task Structure

​1. Input Schema

​2. Tools Configuration

​3. Main Workflow Steps

​Usage

​Example Output

​Next Steps

​Related Concepts

Overview

Task Structure

1. Input Schema

2. Tools Configuration

3. Main Workflow Steps

Usage

Example Output

Next Steps

Related Concepts