This integration is ideal for data engineers who need to enrich large datasets with web intelligence directly in their Spark pipelines—without leaving SQL or building custom API integrations.Parallel provides SQL-native User Defined Functions (UDFs) for Apache Spark that enable data enrichment directly in your SQL queries. The UDFs process rows concurrently within each partition for optimal performance.
Once registered, use parallel_enrich() in any SQL query:
# Create sample dataspark.sql(""" CREATE OR REPLACE TEMP VIEW companies AS SELECT 'Google' as company_name, 'https://google.com' as website UNION ALL SELECT 'Apple', 'https://apple.com'""")# Enrich with Parallelresult = spark.sql(""" SELECT company_name, parallel_enrich( map('company_name', company_name, 'website', website), array('CEO name', 'company description', 'founding year') ) as enriched_data FROM companies""")result.show(truncate=False)
Output:
+------------+-------------------------------------------------------------------------------------------------------------+|company_name|enriched_data |+------------+-------------------------------------------------------------------------------------------------------------+|Google |{"ceo_name": "Sundar Pichai", "founding_year": "1998", "company_description": "Google is an American..."} ||Apple |{"ceo_name": "Tim Cook", "founding_year": "1976", "company_description": "Apple Inc. is an American..."} |+------------+-------------------------------------------------------------------------------------------------------------+
The UDF returns JSON strings. Field names are converted to snake_case (e.g., “CEO name” → ceo_name).Use get_json_object() to extract individual fields:
from pyspark.sql.functions import get_json_objectresult = spark.sql(""" SELECT company_name, get_json_object(enriched_data, '$.ceo_name') as ceo, get_json_object(enriched_data, '$.founding_year') as founded FROM ( SELECT company_name, parallel_enrich( map('company_name', company_name), array('CEO name', 'founding year') ) as enriched_data FROM companies )""")result.show()
Output:
+------------+-------------+-------+|company_name| ceo|founded|+------------+-------------+-------+| Google|Sundar Pichai| 1998|| Apple| Tim Cook| 1976|+------------+-------------+-------+
Or use from_json() with a schema for structured parsing:
Choose a processor based on your speed vs thoroughness requirements. See Choose a Processor for detailed guidance and Pricing for cost information.Use the parallel_enrich_with_processor UDF to override per query: