How to compare values with a CSV file using WOQL

This guide shows you how to compare a list of values against a CSV file using WOQL. This is useful for synchronization tasks where you need to determine what has been added, deleted, or remains unchanged between two data sources.

The Problem

When synchronizing data, you often need to compare:

A list of values you have (e.g., current employee IDs)
A CSV file with reference data (e.g., an HR export)

You want to categorize each value as:

to_add: Values in your list that are not in the CSV
to_delete: Values in the CSV that are not in your list
no_change: Values that exist in both

The Solution: Using Set Operations

The most efficient approach uses WOQL's native set operations (set_difference, set_intersection, set_union) which provide O(n log n) performance. These operations can handle 100,000+ elements in under a second.

See also: Integration tests for these patterns

Example CSV File

Create a file called employee-list.csv:

Example: csv

EmployeeId
E001
E002
E003
E005
E007

Complete WOQL Query with Set Operations

Example: JavaScript

WOQL.and(
  // Step 1: Define your source list using group_by
  WOQL.group_by(
    [],
    ["source_val"],
    "v:source_list",
    WOQL.or(
      WOQL.eq("v:source_val", WOQL.literal("E001", "xsd:string")),
      WOQL.eq("v:source_val", WOQL.literal("E002", "xsd:string")),
      WOQL.eq("v:source_val", WOQL.literal("E004", "xsd:string")),
      WOQL.eq("v:source_val", WOQL.literal("E006", "xsd:string")),
    )
  ),
  
  // Step 2: Read CSV once into a list variable
  WOQL.group_by(
    [],
    ["csv_val"],
    "v:csv_list",
    WOQL.get(WOQL.as("EmployeeId", "v:csv_val"))
        .post("employee-list.csv", { type: "csv" })
  ),
  
  // Step 3: Convert lists to sets (sorts and removes duplicates)
  WOQL.list_to_set("v:source_list", "v:source_set"),
  WOQL.list_to_set("v:csv_list", "v:csv_set"),
  
  // Step 4: Compute set differences and intersection
  WOQL.set_difference("v:source_set", "v:csv_set", "v:to_add"),      // In source but not CSV
  WOQL.set_difference("v:csv_set", "v:source_set", "v:to_delete"),   // In CSV but not source
  WOQL.set_intersection("v:source_set", "v:csv_set", "v:no_change"), // In both
)

Expected Results

With the source list [E001, E002, E004, E006] and CSV containing [E001, E002, E003, E005, E007]:

Example: JSON

{
  "to_add": ["E004", "E006"],
  "to_delete": ["E003", "E005", "E007"],
  "no_change": ["E001", "E002"]
}

How Set Operations Work

Operation	Description	Complexity
`list_to_set`	Sorts list and removes duplicates	O(n log n)
`set_difference(A, B, Result)`	Elements in A but not in B	O(n log n)
`set_intersection(A, B, Result)`	Elements in both A and B	O(n log n)
`set_union(A, B, Result)`	All unique elements from A and B	O(n log n)
`set_member(Element, Set)`	Check if element is in set	O(log n)

Performance Benchmarks (for only the set operation itself)

Elements	set_difference	set_intersection
1,000	7ms	5ms
5,000	30ms	17ms
10,000	64ms	66ms
100,000	577ms	445ms

Comparing Database Content with CSV (Large Datasets)

When comparing large databases against CSV files, the key is to stream through one dataset while using efficient lookups against the other. Never load both datasets entirely into memory.

Strategy: Stream the Larger Dataset, Index the Smaller

Scenario	Stream through	Index/Set
Large DB, small CSV	Database triples	CSV as set
Small DB, large CSV	CSV rows	Database index

Finding Database Records Not in CSV

When you have a large database and a smaller CSV file, load the CSV into a set and stream through database triples:

Example: JavaScript

WOQL.and(
  // Step 1: Load CSV into a set (small, fits in memory)
  WOQL.group_by(
    [],
    ["csv_val"],
    "v:csv_list",
    WOQL.get(WOQL.as("EmployeeId", "v:csv_val"))
        .post("employee-list.csv", { type: "csv" })
  ),
  WOQL.list_to_set("v:csv_list", "v:csv_set"),
  
  // Step 2: Stream through database records via backtracking
  WOQL.triple("v:Employee", "rdf:type", "@schema:Employee"),
  WOQL.triple("v:Employee", "@schema:employee_id", "v:db_id"),
  
  // Step 3: For each DB record, check if NOT in CSV set - O(log n)
  WOQL.not(WOQL.set_member("v:db_id", "v:csv_set")),
)

This streams through database triples one at a time via backtracking. Each set_member check is O(log n), making this efficient even with millions of database records.

Finding CSV Values Not in Database

When checking CSV values against a large database, stream through CSV rows and use indexed triple lookups:

Example: JavaScript

WOQL.and(
  // Step 1: Stream CSV rows via backtracking
  WOQL.get(WOQL.as("EmployeeId", "v:csv_id"))
      .post("employee-list.csv", { type: "csv" }),
  
  // Step 2: For each CSV row, check against database index
  WOQL.not(
    WOQL.triple("v:AnyEmployee", "@schema:employee_id", "v:csv_id")
  ),
)

The triple lookup uses database indexes, so each check is O(log n). This processes one CSV row at a time without loading the entire database.

Complete Streaming Comparison

To get both directions (what's in DB but not CSV, and what's in CSV but not DB) efficiently:

Example: JavaScript

// Query 1: Database records not in CSV
WOQL.and(
  WOQL.group_by([], ["csv_val"], "v:csv_list",
    WOQL.get(WOQL.as("EmployeeId", "v:csv_val"))
        .post("employee-list.csv", { type: "csv" })
  ),
  WOQL.list_to_set("v:csv_list", "v:csv_set"),
  WOQL.triple("v:Employee", "rdf:type", "@schema:Employee"),
  WOQL.triple("v:Employee", "@schema:employee_id", "v:db_id"),
  WOQL.not(WOQL.set_member("v:db_id", "v:csv_set")),
  WOQL.eq("v:status", WOQL.literal("to_remove", "xsd:string")),
)

// Query 2: CSV values not in database
WOQL.and(
  WOQL.get(WOQL.as("EmployeeId", "v:csv_id"))
      .post("employee-list.csv", { type: "csv" }),
  WOQL.not(WOQL.triple("v:AnyEmployee", "@schema:employee_id", "v:csv_id")),
  WOQL.eq("v:status", WOQL.literal("to_add", "xsd:string")),
)

For very large datasets, running these as two separate queries is more memory-efficient than trying to compute both in a single query.

Alternative Approach: Using Member and Not

For smaller datasets or when you need row-by-row categorization in the results, the member-based approach provides clear per-value categorization:

Complete WOQL Query

Example: JavaScript

WOQL.and(
  // Step 1: Define your source list using group_by
  WOQL.group_by(
    [],
    ["source_val"],
    "v:source_list",
    WOQL.or(
      WOQL.eq("v:source_val", WOQL.literal("E001", "xsd:string")),
      WOQL.eq("v:source_val", WOQL.literal("E002", "xsd:string")),
      WOQL.eq("v:source_val", WOQL.literal("E004", "xsd:string")),
      WOQL.eq("v:source_val", WOQL.literal("E006", "xsd:string")),
    )
  ),
  
  // Step 2: Read CSV once into a list variable
  WOQL.group_by(
    [],
    ["csv_val"],
    "v:csv_list",
    WOQL.get(WOQL.as("EmployeeId", "v:csv_val"))
        .post("employee-list.csv", { type: "csv" })
  ),
  
  // Step 3: Get all unique values from both sources
  WOQL.distinct(["v:value"],
    WOQL.or(
      WOQL.member("v:value", "v:source_list"),
      WOQL.member("v:value", "v:csv_list"),
    )
  ),
  
  // Step 4: Categorize each value
  WOQL.or(
    // to_add: in source_list but NOT in csv_list
    WOQL.and(
      WOQL.member("v:value", "v:source_list"),
      WOQL.not(WOQL.member("v:value", "v:csv_list")),
      WOQL.eq("v:category", WOQL.literal("to_add", "xsd:string")),
    ),
    // to_delete: in csv_list but NOT in source_list
    WOQL.and(
      WOQL.member("v:value", "v:csv_list"),
      WOQL.not(WOQL.member("v:value", "v:source_list")),
      WOQL.eq("v:category", WOQL.literal("to_delete", "xsd:string")),
    ),
    // no_change: in BOTH lists
    WOQL.and(
      WOQL.member("v:value", "v:source_list"),
      WOQL.member("v:value", "v:csv_list"),
      WOQL.eq("v:category", WOQL.literal("no_change", "xsd:string")),
    ),
  ),
)

Expected Results

With the source list [E001, E002, E004, E006] and CSV containing [E001, E002, E003, E005, E007]:

Example: JSON

[
  { "value": "E004", "category": "to_add" },
  { "value": "E006", "category": "to_add" },
  { "value": "E003", "category": "to_delete" },
  { "value": "E005", "category": "to_delete" },
  { "value": "E007", "category": "to_delete" },
  { "value": "E001", "category": "no_change" },
  { "value": "E002", "category": "no_change" }
]

Using Set Operations with the JavaScript Client

Example: JavaScript

const TerminusClient = require('@terminusdb/terminusdb-client')

const client = new TerminusClient.WOQLClient('http://localhost:6363', {
  user: 'admin',
  key: 'root',
})

async function compareWithCSV() {
  await client.connect()
  await client.db('mydb')
  
  const csvContent = `EmployeeId
E001
E002
E003
E005
E007`
  
  const query = WOQL.and(
    WOQL.group_by(
      [],
      ["source_val"],
      "v:source_list",
      WOQL.or(
        WOQL.eq("v:source_val", WOQL.literal("E001", "xsd:string")),
        WOQL.eq("v:source_val", WOQL.literal("E002", "xsd:string")),
        WOQL.eq("v:source_val", WOQL.literal("E004", "xsd:string")),
        WOQL.eq("v:source_val", WOQL.literal("E006", "xsd:string")),
      )
    ),
    WOQL.group_by(
      [],
      ["csv_val"],
      "v:csv_list",
      WOQL.get(WOQL.as("EmployeeId", "v:csv_val"))
          .post("employee-list.csv", { type: "csv" })
    ),
    WOQL.list_to_set("v:source_list", "v:source_set"),
    WOQL.list_to_set("v:csv_list", "v:csv_set"),
    WOQL.set_difference("v:source_set", "v:csv_set", "v:to_add"),
    WOQL.set_difference("v:csv_set", "v:source_set", "v:to_delete"),
    WOQL.set_intersection("v:source_set", "v:csv_set", "v:no_change"),
  )
  
  const result = await client.query(query, null, null, {
    files: { 'employee-list.csv': csvContent }
  })
  
  console.log(result.bindings)
}

Comparing Two CSV Files

Compare two CSV files using set operations:

Example: JavaScript

WOQL.and(
  // Read first CSV into set A
  WOQL.group_by(
    [],
    ["val_a"],
    "v:list_a",
    WOQL.get(WOQL.as("Id", "v:val_a"))
        .post("file-a.csv", { type: "csv" })
  ),
  WOQL.list_to_set("v:list_a", "v:set_a"),
  
  // Read second CSV into set B
  WOQL.group_by(
    [],
    ["val_b"],
    "v:list_b",
    WOQL.get(WOQL.as("Id", "v:val_b"))
        .post("file-b.csv", { type: "csv" })
  ),
  WOQL.list_to_set("v:list_b", "v:set_b"),
  
  // Compute differences and intersection
  WOQL.set_difference("v:set_a", "v:set_b", "v:only_in_a"),
  WOQL.set_difference("v:set_b", "v:set_a", "v:only_in_b"),
  WOQL.set_intersection("v:set_a", "v:set_b", "v:in_both"),
)

Performance Considerations

Set Operations (Recommended)

The set operations provide excellent performance for large datasets:

Dataset Size	Performance	Recommendation
Up to 10,000	< 100ms	Use set operations directly
10,000-100,000	100-600ms	Use set operations directly
100,000+	~1 second	Consider chunked processing

Member-Based Approach

The member-based approach has O(n²) complexity and works well for smaller datasets:

Dataset Size	Performance	Recommendation
Up to 1,000	Fast	Works well
1,000-5,000	Several seconds	Consider set operations
5,000+	May timeout	Use set operations instead

Very Large Datasets (Millions of Records)

For very large databases or CSV files:

Use streaming patterns - Process CSV rows one at a time with backtracking instead of collecting into lists
Use indexed triple lookups - Let the database use indexes for individual lookups
Collect CSV into a set, iterate database - Keep the smaller dataset in memory as a set, iterate the larger one
Consider chunked processing - Process data in batches of 50,000-100,000 records

Summary

The recommended techniques for CSV comparison in WOQL are:

Use group_by to read CSV data once into a list variable
Use list_to_set to convert lists to sorted sets
Use set_difference and set_intersection for O(n log n) comparison
Use set_member for O(log n) membership checks against large sets

For comparing database content with CSV:

Use triple patterns to query database values efficiently
Use backtracking for memory-efficient processing of large datasets
Collect the smaller dataset into a set, iterate the larger one

The set operations can handle 100,000+ elements in under a second, making them suitable for production synchronization tasks.

The Problem

The Solution: Using Set Operations

Example CSV File

Complete WOQL Query with Set Operations

Expected Results

How Set Operations Work

Performance Benchmarks (for only the set operation itself)

Comparing Database Content with CSV (Large Datasets)

Strategy: Stream the Larger Dataset, Index the Smaller

Finding Database Records Not in CSV

Finding CSV Values Not in Database

Complete Streaming Comparison

Alternative Approach: Using Member and Not

Complete WOQL Query

Expected Results

Using Set Operations with the JavaScript Client

Comparing Two CSV Files

Performance Considerations

Set Operations (Recommended)

Member-Based Approach

Very Large Datasets (Millions of Records)

Summary

Was this helpful?