Overview

This tutorial demonstrates how to:

  • Set up a web crawler using Julep’s Spider integration
  • Process and store crawled content in a document store
  • Implement RAG for enhanced AI responses
  • Create an intelligent agent that can answer questions about crawled content

Task Structure

Let’s break down the task into its core components:

1. Input Schema

First, we define what inputs our task expects:

input_schema:
    type: object
    properties:
        url:
            type: string
        pages_limit:
            type: integer

This schema specifies that our task expects:

  • A URL string (e.g., β€œhttps://julep.ai/”)
  • Number of pages to crawl (e.g., 5)

2. Tools Configuration

Next, we define the external tools our task will use:

- name: spider_crawler
  type: integration
  integration:
    provider: spider
    method: crawl
    setup:
      spider_api_key: YOUR_SPIDER_API_KEY

- name: create_agent_doc
  type: system
  system:
    resource: agent
    subresource: doc
    operation: create

We’re using two tools:

  • The spider_crawler integration for web crawling
  • The create_agent_doc system tool for storing processed content

3. Main Workflow Steps

1

Crawl Website

- tool: spider_crawler
  arguments:
    url: $ _['url']
    params:
      request: "smart_mode"
      limit: $ _['pages_limit']
      return_format: "markdown"
      proxy_enabled: "True"
      filter_output_images: "True"
      filter_output_svg: "True"
      readability: "True"
      sitemap: "True"
      chunking_alg:
        type: "bysentence"
        value: "15"

This step:

  • Takes the input URL and crawls the website
  • Processes content into readable markdown format
  • Chunks content into manageable segments
  • Filters out unnecessary elements like images and SVGs
2

Process and Index Content

- evaluate:
    documents: $ _['content']

- over: $ "[(steps[0].input.content, chunk) for chunk in _['documents']]"
  parallelism: 3
  map:
    prompt:
    - role: user
      content: >-
        $ f'''
        <document>
        {_[0]}
        </document>

        <chunk>
        {_[1]}
        </chunk>

        Please give a short succinct context to situate this chunk within the overall document.'''

This step:

  • Processes each content chunk in parallel
  • Generates contextual metadata for improved retrieval
  • Prepares content for storage
3

Store Documents

- over: $ _['final_chunks']
  parallelism: 3
  map:
    tool: create_agent_doc
    arguments:
      agent_id: $ agent.id
      data:
        metadata:
          source: spider_crawler
        title: Website Document
        content: $ _

This step:

  • Stores processed content in the document store
  • Adds metadata for source tracking
  • Creates searchable documents for RAG

Usage

Start by creating an execution for the task. This execution will make the agent crawl the website and store the content in the document store.

from julep import Client
import time
import yaml

# Initialize the client
client = Client(api_key=JULEP_API_KEY)

# Create the agent
agent = client.agents.create(
  name="Julep Crawler Agent",
  description="A Julep agent that can crawl a website and store the content in the document store.",
)

# Load the task definition
with open('crawling_task.yaml', 'r') as file:
  task_definition = yaml.safe_load(file)

# Create the task
task = client.tasks.create(
  agent_id=agent.id,
  **task_definition
)

# Create the execution
execution = client.executions.create(
  task_id=task.id,
  input={"url": "https://julep.ai/", "pages_limit": 5}
)
# Wait for the execution to complete
while (result := client.executions.get(execution.id)).status not in ['succeeded', 'failed']:
    print(result.status)
    time.sleep(1)

if result.status == "succeeded":
    print(result.output)
else:
    print(f"Error: {result.error}")

Next, create a session for the agent. This session will be used to chat with the agent.

session = client.sessions.create(
  agent_id=AGENT_ID
)

Finally, chat with the agent.

response = client.sessions.chat(
  session_id=session.id,
  messages=[
    {
      "role": "user",
      "content": "What is Julep?"
    }
  ]
)

print(response)

Example Output

This is an example output when the agent is asked β€œWhat is Julep?”

Next Steps

Try this task yourself, check out the full example, see the Customer Support Chatbot using RAG cookbook.