The _ variable refers to the current context object. When accessing properties like _['url'], itβs retrieving values from the input parameters passed to the task.
This step:
Takes the input URL and crawls the website
Processes content into readable markdown format
Chunks content into manageable segments
Filters out unnecessary elements like images and SVGs
2
Process and Index Content
- evaluate: document: $ _.document chunks: | $ [" ".join(_.content[i:i + max(_.reducing_strength, len(_.content) // 9)]) for i in range(0, len(_.content), max(_.reducing_strength, len(_.content) // 9))] label: docs# Step 1: Create a new document and add it to the agent docs store- over: $ [(steps[0].input.document, chunk.strip()) for chunk in _.chunks] parallelism: 3 map: prompt: - role: user content: >- $ f''' <document> {_[0]} </document> Here is the chunk we want to situate within the whole document <chunk> {_[1]} </chunk> Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else. ''' unwrap: true settings: max_tokens: 16000- evaluate: final_chunks: | $ [ NEWLINE.join([succint, chunk.strip()]) for chunk, succint in zip(steps['docs'].output.chunks, _) ]
This step:
Processes each content chunk in parallel
Generates contextual metadata for improved retrieval
# yaml-language-server: $schema=https://raw.githubusercontent.com/julep-ai/julep/refs/heads/dev/schemas/create_task_request.jsonname: Julep Jina Crawler Taskdescription: A Julep agent that can crawl a website and store the content in the document store.########################################################################### INPUT SCHEMA ###############################################################################input_schema: type: object properties: url: type: string reducing_strength: type: integer########################################################################### TOOLS ######################################################################################tools:- name: get_page type: api_call api_call: method: GET url: https://r.jina.ai/ headers: accept: application/json x-return-format: markdown x-with-images-summary: "true" x-with-links-summary: "true" x-retain-images: "none" x-no-cache: "true" Authorization: "Bearer JINA_API_KEY"- name : create_agent_doc description: Create an agent doc type: system system: resource: agent subresource: doc operation: create########################################################################### INDEX PAGE SUBWORKFLOW ######################################################################index_page:# Step #0 - Evaluate the content- evaluate: document: $ _.document chunks: | $ [" ".join(_.content[i:i + max(_.reducing_strength, len(_.content) // 9)]) for i in range(0, len(_.content), max(_.reducing_strength, len(_.content) // 9))] label: docs# Step #1 - Process each content chunk in parallel- over: $ "[(steps[0].input.content, chunk) for chunk in _['documents']]" parallelism: 3 map: prompt: - role: user content: >- $ f''' <document> {_[0]} </document> Here is the chunk we want to situate within the whole document <chunk> {_[1]} </chunk> Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.''' unwrap: true settings: max_tokens: 16000# Step #2 - Create a new document and add it to the agent docs store- evaluate: final_chunks: | $ [ NEWLINE.join([chunk, succint]) for chunk, succint in zip(steps[1].input.documents, _) ]# Step #3 - Create a new document and add it to the agent docs store- over: $ _['final_chunks'] parallelism: 3 map: tool: create_agent_doc arguments: agent_id: "00000000-0000-0000-0000-000000000000" # <--- This is the agent id of the agent you want to add the document to data: metadata: source: "jina_crawler" title: "Website Document" content: $ _########################################################################### MAIN WORKFLOW ##############################################################################main:# Step 0: Get the content of the product page- tool: get_page arguments: url: $ "https://r.jina.ai/" + steps[0].input.url# Step 1: Chunk the content- evaluate: result: $ chunk_doc(_.json.data.content.strip())# Step 2: Evaluate step to document chunks- workflow: index_page arguments: content: $ _.result document: $ steps[0].output.json.data.content.strip() reducing_strength: $ steps[0].input.reducing_strength
Start by creating an execution for the task. This execution will make the agent crawl the website and store the content in the document store.
from julep import Clientimport timeimport yaml# Initialize the clientclient = Client(api_key=JULEP_API_KEY)# Create the agentagent = client.agents.create( name="Julep Jina Crawler Agent", about="A Julep agent that can crawl a website and store the content in the document store.",)# Load the task definitionwith open('crawling_task.yaml', 'r') as file: task_definition = yaml.safe_load(file)# Create the tasktask = client.tasks.create( agent_id=agent.id, **task_definition)# Create the executionexecution = client.executions.create( task_id=task.id, input={"url": "https://en.wikipedia.org/wiki/Artificial_intelligence, "reducing_strength": 5})# Wait for the execution to completewhile (result := client.executions.get(execution.id)).status not in ['succeeded', 'failed']: print(result.status) time.sleep(1)if result.status == "succeeded": print(result.output)else: print(f"Error: {result.error}")
Next, create a session for the agent. This session will be used to chat with the agent.