Skip to main content
This integration is ideal for data scientists and engineers who work with Polars DataFrames and need to enrich data with web intelligence directly in their Python workflows. Parallel provides a native Polars integration that enables DataFrame-native data enrichment with batch processing for efficiency.
View the complete demo notebook:

Features

  • DataFrame-Native: Enriched columns added directly to your Polars DataFrame
  • Batch Processing: All rows processed in a single API call for efficiency
  • LazyFrame Support: Works with both eager and lazy DataFrames
  • Partial Results: Failed rows return None without stopping the entire batch

Installation

pip install parallel-web-tools[polars]
Or with all dependencies:
pip install parallel-web-tools[all]

Basic Usage

import polars as pl
from parallel_web_tools.integrations.polars import parallel_enrich

# Create a DataFrame
df = pl.DataFrame({
    "company": ["Google", "Microsoft", "Apple"],
    "website": ["google.com", "microsoft.com", "apple.com"],
})

# Enrich with company information
result = parallel_enrich(
    df,
    input_columns={
        "company_name": "company",
        "website": "website",
    },
    output_columns=[
        "CEO name",
        "Founding year",
        "Headquarters city",
    ],
)

# Access the enriched DataFrame
print(result.result)
print(f"Success: {result.success_count}, Errors: {result.error_count}")
Output:
companywebsiteceo_namefounding_yearheadquarters_city
Googlegoogle.comSundar Pichai1998Mountain View
Microsoftmicrosoft.comSatya Nadella1975Redmond
Appleapple.comTim Cook1976Cupertino

Function Parameters

ParameterTypeDefaultDescription
dfpl.DataFramerequiredDataFrame to enrich
input_columnsdict[str, str]requiredMapping of input descriptions to column names
output_columnslist[str]requiredList of output column descriptions
api_keystr | NoneNoneAPI key (uses PARALLEL_API_KEY env var if not provided)
processorstr"lite-fast"Parallel processor to use
timeoutint600Timeout in seconds
include_basisboolFalseInclude citations in results

Return Value

The function returns an EnrichmentResult dataclass:
@dataclass
class EnrichmentResult:
    result: pl.DataFrame      # Enriched DataFrame
    success_count: int        # Number of successful rows
    error_count: int          # Number of failed rows
    errors: list[dict]        # Error details with row index
    elapsed_time: float       # Processing time in seconds

Column Name Mapping

Output column descriptions are automatically converted to valid Python identifiers. Field names are converted to snake_case:
DescriptionColumn Name
"CEO name"ceo_name
"Founding year (YYYY)"founding_year
"Annual revenue [USD]"annual_revenue

LazyFrame Support

Use parallel_enrich_lazy() to work with LazyFrames:
from parallel_web_tools.integrations.polars import parallel_enrich_lazy

# Read from CSV lazily
lf = pl.scan_csv("companies.csv")

# Filter and select
lf = lf.filter(pl.col("active") == True).select(["name", "website"])

# Enrich (will collect the LazyFrame)
result = parallel_enrich_lazy(
    lf,
    input_columns={"company_name": "name", "website": "website"},
    output_columns=["CEO name"],
)

Including Citations

result = parallel_enrich(
    df,
    input_columns={"company_name": "company"},
    output_columns=["CEO name"],
    include_basis=True,
)

# Access citations in the _basis column
for row in result.result.iter_rows(named=True):
    print(f"CEO: {row['ceo_name']}")
    print(f"Sources: {row['_basis']}")

Processor Selection

Choose a processor based on your speed vs thoroughness requirements. See Choose a Processor for detailed guidance and Pricing for cost information.
ProcessorSpeedBest For
lite-fastFastestBasic metadata, high volume
base-fastFastStandard enrichments
core-fastMediumCross-referenced data
pro-fastSlowerDeep research

Best Practices

Be specific in your output column descriptions for better results:
output_columns = [
    "CEO name (current CEO or equivalent leader)",
    "Founding year (YYYY format)",
    "Annual revenue (USD, most recent fiscal year)",
]
Errors don’t stop processing - partial results are returned:
result = parallel_enrich(df, ...)

if result.error_count > 0:
    print(f"Failed rows: {result.error_count}")
    for error in result.errors:
        print(f"  Row {error['row']}: {error['error']}")

# Filter successful rows
successful_df = result.result.filter(pl.col("ceo_name").is_not_null())
For very large datasets (1000+ rows), consider processing in batches:
def enrich_in_batches(df: pl.DataFrame, batch_size: int = 100):
    results = []
    for i in range(0, len(df), batch_size):
        batch = df.slice(i, batch_size)
        result = parallel_enrich(batch, ...)
        results.append(result.result)
    return pl.concat(results)
  • Use lite-fast for high-volume, basic enrichments
  • Test with small batches before processing large DataFrames
  • Store results to avoid re-enriching the same data