Features
- SQL-Native: Use
parallel_enrich()directly in Spark SQL queries - Concurrent Processing: All rows in each partition are processed concurrently using asyncio
- Configurable Processors: Choose from lite-fast to ultra for speed vs thoroughness tradeoffs
- Structured Output: Returns JSON that can be parsed with Spark’s
from_json()
Installation
Setup
- Get your API key from Parallel
- Register the UDFs with your Spark session:
Configuration Options
Basic Usage
Once registered, useparallel_enrich() in any SQL query:
UDF Parameters
| Parameter | Type | Description |
|---|---|---|
input_data | map<string, string> | Key-value pairs of input data for enrichment |
output_columns | array<string> | Descriptions of the columns you want to enrich |
Parsing Results
The UDF returns JSON strings. Field names are converted to snake_case (e.g., “CEO name” →ceo_name).
Use get_json_object() to extract individual fields:
from_json() with a schema for structured parsing:
Including Basis/Citations
To include source citations in your enrichment results, setinclude_basis=True:
_basis field with citations:
Processor Selection
Choose a processor based on your speed vs thoroughness requirements. See Choose a Processor for detailed guidance and Pricing for cost information. Use theparallel_enrich_with_processor UDF to override per query:
Best Practices
Partition sizing
Partition sizing
The UDF processes all rows in a partition concurrently. For optimal performance:
- Use
repartition()to control partition sizes - Aim for 10-100 rows per partition for balanced concurrency
Error handling
Error handling
Failed enrichments return JSON with an Filter these in your downstream processing.
error field:Rate limits
Rate limits
Concurrent processing respects Parallel’s rate limits. For large datasets, consider:
- Reducing partition sizes
- Using slower processors that have higher rate limits