For AI agents: a documentation index is available at https://docs.parallel.ai/llms.txt. The full text of all docs is at https://docs.parallel.ai/llms-full.txt. You may also fetch any page as Markdown by appending
This integration is ideal for data engineers who need to enrich large datasets with web intelligence directly in their Spark pipelines—without leaving SQL or building custom API integrations.
Parallel provides SQL-native User Defined Functions (UDFs) for Apache Spark that enable data enrichment directly in your SQL queries. The UDFs process rows concurrently within each partition for optimal performance.
.md to its URL or sending Accept: text/markdown.Features
- SQL-Native: Use
parallel_enrich()directly in Spark SQL queries - Concurrent Processing: All rows in each partition are processed concurrently using asyncio
- Configurable Processors: Choose from lite-fast to ultra for speed vs thoroughness tradeoffs
- Structured Output: Returns JSON that can be parsed with Spark’s
from_json()
Installation
Setup
- Get your API key from Parallel
- Register the UDFs with your Spark session:
Configuration Options
Basic Usage
Once registered, useparallel_enrich() in any SQL query:
UDF Parameters
| Parameter | Type | Description |
|---|---|---|
input_data | map<string, string> | Key-value pairs of input data for enrichment |
output_columns | array<string> | Descriptions of the columns you want to enrich |
Parsing Results
The UDF returns JSON strings. Field names are converted to snake_case (e.g., “CEO name” →ceo_name).
Use get_json_object() to extract individual fields:
from_json() with a schema for structured parsing:
Including Basis/Citations
To include source citations in your enrichment results, setinclude_basis=True:
_basis field with citations:
Processor Selection
Choose a processor based on your speed vs thoroughness requirements. See Choose a Processor for detailed guidance and Pricing for cost information. Use theparallel_enrich_with_processor UDF to override per query:
Best Practices
Partition sizing
Partition sizing
The UDF processes all rows in a partition concurrently. For optimal performance:
- Use
repartition()to control partition sizes - Aim for 10-100 rows per partition for balanced concurrency
Error handling
Error handling
Failed enrichments return JSON with an Filter these in your downstream processing.
error field:Rate limits
Rate limits
Concurrent processing respects Parallel’s rate limits. For large datasets, consider:
- Reducing partition sizes
- Using slower processors that have higher rate limits