Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.parallel.ai/llms.txt

Use this file to discover all available pages before exploring further.

For AI agents: a documentation index is available at https://docs.parallel.ai/llms.txt. The full text of all docs is at https://docs.parallel.ai/llms-full.txt. You may also fetch any page as Markdown by appending .md to its URL or sending Accept: text/markdown.
This integration is ideal for data scientists and engineers who work with Polars DataFrames and need to enrich data with web intelligence directly in their Python workflows. Parallel provides a native Polars integration that enables DataFrame-native data enrichment with batch processing for efficiency.
View the complete demo notebook:

Features

  • DataFrame-Native: Enriched columns added directly to your Polars DataFrame
  • Batch Processing: All rows processed in a single API call for efficiency
  • LazyFrame Support: Works with both eager and lazy DataFrames
  • Partial Results: Failed rows return None without stopping the entire batch

Installation

pip install parallel-web-tools[polars]
Or with all dependencies:
pip install parallel-web-tools[all]

Basic Usage

import polars as pl
from parallel_web_tools.integrations.polars import parallel_enrich

# Create a DataFrame
df = pl.DataFrame({
    "company": ["Google", "Microsoft", "Apple"],
    "website": ["google.com", "microsoft.com", "apple.com"],
})

# Enrich with company information
result = parallel_enrich(
    df,
    input_columns={
        "company_name": "company",
        "website": "website",
    },
    output_columns=[
        "CEO name",
        "Founding year",
        "Headquarters city",
    ],
)

# Access the enriched DataFrame
print(result.result)
print(f"Success: {result.success_count}, Errors: {result.error_count}")
Output:
companywebsiteceo_namefounding_yearheadquarters_city
Googlegoogle.comSundar Pichai1998Mountain View
Microsoftmicrosoft.comSatya Nadella1975Redmond
Appleapple.comTim Cook1976Cupertino

Function Parameters

ParameterTypeDefaultDescription
dfpl.DataFramerequiredDataFrame to enrich
input_columnsdict[str, str]requiredMapping of input descriptions to column names
output_columnslist[str]requiredList of output column descriptions
api_keystr | NoneNoneAPI key (uses PARALLEL_API_KEY env var if not provided)
processorstr"lite-fast"Parallel processor to use
timeoutint600Timeout in seconds
include_basisboolFalseInclude citations in results

Return Value

The function returns an EnrichmentResult dataclass:
@dataclass
class EnrichmentResult:
    result: pl.DataFrame      # Enriched DataFrame
    success_count: int        # Number of successful rows
    error_count: int          # Number of failed rows
    errors: list[dict]        # Error details with row index
    elapsed_time: float       # Processing time in seconds

Column Name Mapping

Output column descriptions are automatically converted to valid Python identifiers. Field names are converted to snake_case:
DescriptionColumn Name
"CEO name"ceo_name
"Founding year (YYYY)"founding_year
"Annual revenue [USD]"annual_revenue

LazyFrame Support

Use parallel_enrich_lazy() to work with LazyFrames:
from parallel_web_tools.integrations.polars import parallel_enrich_lazy

# Read from CSV lazily
lf = pl.scan_csv("companies.csv")

# Filter and select
lf = lf.filter(pl.col("active") == True).select(["name", "website"])

# Enrich (will collect the LazyFrame)
result = parallel_enrich_lazy(
    lf,
    input_columns={"company_name": "name", "website": "website"},
    output_columns=["CEO name"],
)

Including Citations

result = parallel_enrich(
    df,
    input_columns={"company_name": "company"},
    output_columns=["CEO name"],
    include_basis=True,
)

# Access citations in the _basis column
for row in result.result.iter_rows(named=True):
    print(f"CEO: {row['ceo_name']}")
    print(f"Sources: {row['_basis']}")

Processor Selection

Choose a processor based on your speed vs thoroughness requirements. See Choose a Processor for detailed guidance and Pricing for cost information.
ProcessorSpeedBest For
lite-fastFastestBasic metadata, high volume
base-fastFastStandard enrichments
core-fastMediumCross-referenced data
pro-fastSlowerDeep research

Best Practices

Be specific in your output column descriptions for better results:
output_columns = [
    "CEO name (current CEO or equivalent leader)",
    "Founding year (YYYY format)",
    "Annual revenue (USD, most recent fiscal year)",
]
Errors don’t stop processing - partial results are returned:
result = parallel_enrich(df, ...)

if result.error_count > 0:
    print(f"Failed rows: {result.error_count}")
    for error in result.errors:
        print(f"  Row {error['row']}: {error['error']}")

# Filter successful rows
successful_df = result.result.filter(pl.col("ceo_name").is_not_null())
For very large datasets (1000+ rows), consider processing in batches:
def enrich_in_batches(df: pl.DataFrame, batch_size: int = 100):
    results = []
    for i in range(0, len(df), batch_size):
        batch = df.slice(i, batch_size)
        result = parallel_enrich(batch, ...)
        results.append(result.result)
    return pl.concat(results)
  • Use lite-fast for high-volume, basic enrichments
  • Test with small batches before processing large DataFrames
  • Store results to avoid re-enriching the same data