Skip to main content

Overview

Welcome to the Spider Crawler integration guide for Julep! This integration allows you to crawl websites and extract data, enabling you to build workflows that require web scraping capabilities. Whether you’re gathering data for analysis or monitoring web content, this guide will walk you through the setup and usage.

Prerequisites

To use the Spider integration, you need an API key. You can obtain this key by signing up at Spider.

How to Use the Integration

To get started with the Spider integration, follow these steps to configure and create a task:
1

Configure Your API Key

Add your API key to the tools section of your task. This will allow Julep to authenticate requests to Spider on your behalf.
2

Create Task Definition

Use the following YAML configuration to define your web crawling task:
Spider Example
name: Spider Task
tools:
- name: spider_tool
  type: integration
  integration:
    provider: spider
    method: crawl
    setup:
      spider_api_key: "SPIDER_API_KEY"
main:
- tool: spider_tool
  method: crawl
  arguments:
    url: $ _.url
    params: # Optional parameters
      key1: value1 # this a placeholder for the actual parameters
    content_type: application/json

YAML Explanation

  • name: A descriptive name for the task, in this case, “Spider Task”.
  • tools: This section lists the tools or integrations being used. Here, spider_tool is defined as an integration tool.
  • type: Specifies the type of tool, which is integration in this context.
  • integration: Details the provider and setup for the integration.
    • provider: Indicates the service provider, which is spider for Spider.
    • method: Specifies the method to use, such as crawl, links, screenshot, or search. Defaults to crawl if not specified.
    • setup: Contains configuration details, such as the API key (spider_api_key) required for authentication.
  • main: Defines the main execution steps.
    • tool: Refers to the tool defined earlier (spider_tool).
    • arguments: Specifies the input parameters for the tool:
      • url: The URL for which to fetch data.
      • params: (optional) The parameters for the Spider API. Defaults to None.
      • content_type: (optional) The content type to return. Default is “application/json”. Other options: “text/csv”, “application/xml”, “application/jsonl”.
Remember to replace SPIDER_API_KEY with your actual API key. Customize the url, params, and content_type parameters to suit your specific needs.
The different parameters available depending on the method used for the Spider integration can be found in the Spider API documentation.

Conclusion

With the Spider integration, you can efficiently crawl websites and extract valuable data. This integration provides a robust solution for web scraping, enhancing your workflow’s capabilities and user experience.
For more information, please refer to the Spider API documentation.
⌘I