How to compare values with a CSV file using WOQL
This guide shows you how to compare a list of values against a CSV file using WOQL. This is useful for synchronization tasks where you need to determine what has been added, deleted, or remains unchanged between two data sources.
The Problem
When synchronizing data, you often need to compare:
- A list of values you have (e.g., current employee IDs)
- A CSV file with reference data (e.g., an HR export)
You want to categorize each value as:
- to_add: Values in your list that are not in the CSV
- to_delete: Values in the CSV that are not in your list
- no_change: Values that exist in both
The Solution: Using Set Operations
The most efficient approach uses WOQL's native set operations (set_difference, set_intersection, set_union) which provide O(n log n) performance. These operations can handle 100,000+ elements in under a second.
See also: Integration tests for these patterns
Example CSV File
Create a file called employee-list.csv:
EmployeeId
E001
E002
E003
E005
E007Complete WOQL Query with Set Operations
WOQL.and(
// Step 1: Define your source list using group_by
WOQL.group_by(
[],
["source_val"],
"v:source_list",
WOQL.or(
WOQL.eq("v:source_val", WOQL.literal("E001", "xsd:string")),
WOQL.eq("v:source_val", WOQL.literal("E002", "xsd:string")),
WOQL.eq("v:source_val", WOQL.literal("E004", "xsd:string")),
WOQL.eq("v:source_val", WOQL.literal("E006", "xsd:string")),
)
),
// Step 2: Read CSV once into a list variable
WOQL.group_by(
[],
["csv_val"],
"v:csv_list",
WOQL.get(WOQL.as("EmployeeId", "v:csv_val"))
.post("employee-list.csv", { type: "csv" })
),
// Step 3: Convert lists to sets (sorts and removes duplicates)
WOQL.list_to_set("v:source_list", "v:source_set"),
WOQL.list_to_set("v:csv_list", "v:csv_set"),
// Step 4: Compute set differences and intersection
WOQL.set_difference("v:source_set", "v:csv_set", "v:to_add"), // In source but not CSV
WOQL.set_difference("v:csv_set", "v:source_set", "v:to_delete"), // In CSV but not source
WOQL.set_intersection("v:source_set", "v:csv_set", "v:no_change"), // In both
)Expected Results
With the source list [E001, E002, E004, E006] and CSV containing [E001, E002, E003, E005, E007]:
{
"to_add": ["E004", "E006"],
"to_delete": ["E003", "E005", "E007"],
"no_change": ["E001", "E002"]
}How Set Operations Work
| Operation | Description | Complexity |
|---|---|---|
list_to_set | Sorts list and removes duplicates | O(n log n) |
set_difference(A, B, Result) | Elements in A but not in B | O(n log n) |
set_intersection(A, B, Result) | Elements in both A and B | O(n log n) |
set_union(A, B, Result) | All unique elements from A and B | O(n log n) |
set_member(Element, Set) | Check if element is in set | O(log n) |
Performance Benchmarks (for only the set operation itself)
| Elements | set_difference | set_intersection |
|---|---|---|
| 1,000 | 7ms | 5ms |
| 5,000 | 30ms | 17ms |
| 10,000 | 64ms | 66ms |
| 100,000 | 577ms | 445ms |
Comparing Database Content with CSV (Large Datasets)
When comparing large databases against CSV files, the key is to stream through one dataset while using efficient lookups against the other. Never load both datasets entirely into memory.
Strategy: Stream the Larger Dataset, Index the Smaller
| Scenario | Stream through | Index/Set |
|---|---|---|
| Large DB, small CSV | Database triples | CSV as set |
| Small DB, large CSV | CSV rows | Database index |
Finding Database Records Not in CSV
When you have a large database and a smaller CSV file, load the CSV into a set and stream through database triples:
WOQL.and(
// Step 1: Load CSV into a set (small, fits in memory)
WOQL.group_by(
[],
["csv_val"],
"v:csv_list",
WOQL.get(WOQL.as("EmployeeId", "v:csv_val"))
.post("employee-list.csv", { type: "csv" })
),
WOQL.list_to_set("v:csv_list", "v:csv_set"),
// Step 2: Stream through database records via backtracking
WOQL.triple("v:Employee", "rdf:type", "@schema:Employee"),
WOQL.triple("v:Employee", "@schema:employee_id", "v:db_id"),
// Step 3: For each DB record, check if NOT in CSV set - O(log n)
WOQL.not(WOQL.set_member("v:db_id", "v:csv_set")),
)This streams through database triples one at a time via backtracking. Each set_member check is O(log n), making this efficient even with millions of database records.
Finding CSV Values Not in Database
When checking CSV values against a large database, stream through CSV rows and use indexed triple lookups:
WOQL.and(
// Step 1: Stream CSV rows via backtracking
WOQL.get(WOQL.as("EmployeeId", "v:csv_id"))
.post("employee-list.csv", { type: "csv" }),
// Step 2: For each CSV row, check against database index
WOQL.not(
WOQL.triple("v:AnyEmployee", "@schema:employee_id", "v:csv_id")
),
)The triple lookup uses database indexes, so each check is O(log n). This processes one CSV row at a time without loading the entire database.
Complete Streaming Comparison
To get both directions (what's in DB but not CSV, and what's in CSV but not DB) efficiently:
// Query 1: Database records not in CSV
WOQL.and(
WOQL.group_by([], ["csv_val"], "v:csv_list",
WOQL.get(WOQL.as("EmployeeId", "v:csv_val"))
.post("employee-list.csv", { type: "csv" })
),
WOQL.list_to_set("v:csv_list", "v:csv_set"),
WOQL.triple("v:Employee", "rdf:type", "@schema:Employee"),
WOQL.triple("v:Employee", "@schema:employee_id", "v:db_id"),
WOQL.not(WOQL.set_member("v:db_id", "v:csv_set")),
WOQL.eq("v:status", WOQL.literal("to_remove", "xsd:string")),
)
// Query 2: CSV values not in database
WOQL.and(
WOQL.get(WOQL.as("EmployeeId", "v:csv_id"))
.post("employee-list.csv", { type: "csv" }),
WOQL.not(WOQL.triple("v:AnyEmployee", "@schema:employee_id", "v:csv_id")),
WOQL.eq("v:status", WOQL.literal("to_add", "xsd:string")),
)For very large datasets, running these as two separate queries is more memory-efficient than trying to compute both in a single query.
Alternative Approach: Using Member and Not
For smaller datasets or when you need row-by-row categorization in the results, the member-based approach provides clear per-value categorization:
Complete WOQL Query
WOQL.and(
// Step 1: Define your source list using group_by
WOQL.group_by(
[],
["source_val"],
"v:source_list",
WOQL.or(
WOQL.eq("v:source_val", WOQL.literal("E001", "xsd:string")),
WOQL.eq("v:source_val", WOQL.literal("E002", "xsd:string")),
WOQL.eq("v:source_val", WOQL.literal("E004", "xsd:string")),
WOQL.eq("v:source_val", WOQL.literal("E006", "xsd:string")),
)
),
// Step 2: Read CSV once into a list variable
WOQL.group_by(
[],
["csv_val"],
"v:csv_list",
WOQL.get(WOQL.as("EmployeeId", "v:csv_val"))
.post("employee-list.csv", { type: "csv" })
),
// Step 3: Get all unique values from both sources
WOQL.distinct(["v:value"],
WOQL.or(
WOQL.member("v:value", "v:source_list"),
WOQL.member("v:value", "v:csv_list"),
)
),
// Step 4: Categorize each value
WOQL.or(
// to_add: in source_list but NOT in csv_list
WOQL.and(
WOQL.member("v:value", "v:source_list"),
WOQL.not(WOQL.member("v:value", "v:csv_list")),
WOQL.eq("v:category", WOQL.literal("to_add", "xsd:string")),
),
// to_delete: in csv_list but NOT in source_list
WOQL.and(
WOQL.member("v:value", "v:csv_list"),
WOQL.not(WOQL.member("v:value", "v:source_list")),
WOQL.eq("v:category", WOQL.literal("to_delete", "xsd:string")),
),
// no_change: in BOTH lists
WOQL.and(
WOQL.member("v:value", "v:source_list"),
WOQL.member("v:value", "v:csv_list"),
WOQL.eq("v:category", WOQL.literal("no_change", "xsd:string")),
),
),
)Expected Results
With the source list [E001, E002, E004, E006] and CSV containing [E001, E002, E003, E005, E007]:
[
{ "value": "E004", "category": "to_add" },
{ "value": "E006", "category": "to_add" },
{ "value": "E003", "category": "to_delete" },
{ "value": "E005", "category": "to_delete" },
{ "value": "E007", "category": "to_delete" },
{ "value": "E001", "category": "no_change" },
{ "value": "E002", "category": "no_change" }
]Using Set Operations with the JavaScript Client
const TerminusClient = require('@terminusdb/terminusdb-client')
const client = new TerminusClient.WOQLClient('http://localhost:6363', {
user: 'admin',
key: 'root',
})
async function compareWithCSV() {
await client.connect()
await client.db('mydb')
const csvContent = `EmployeeId
E001
E002
E003
E005
E007`
const query = WOQL.and(
WOQL.group_by(
[],
["source_val"],
"v:source_list",
WOQL.or(
WOQL.eq("v:source_val", WOQL.literal("E001", "xsd:string")),
WOQL.eq("v:source_val", WOQL.literal("E002", "xsd:string")),
WOQL.eq("v:source_val", WOQL.literal("E004", "xsd:string")),
WOQL.eq("v:source_val", WOQL.literal("E006", "xsd:string")),
)
),
WOQL.group_by(
[],
["csv_val"],
"v:csv_list",
WOQL.get(WOQL.as("EmployeeId", "v:csv_val"))
.post("employee-list.csv", { type: "csv" })
),
WOQL.list_to_set("v:source_list", "v:source_set"),
WOQL.list_to_set("v:csv_list", "v:csv_set"),
WOQL.set_difference("v:source_set", "v:csv_set", "v:to_add"),
WOQL.set_difference("v:csv_set", "v:source_set", "v:to_delete"),
WOQL.set_intersection("v:source_set", "v:csv_set", "v:no_change"),
)
const result = await client.query(query, null, null, {
files: { 'employee-list.csv': csvContent }
})
console.log(result.bindings)
}Comparing Two CSV Files
Compare two CSV files using set operations:
WOQL.and(
// Read first CSV into set A
WOQL.group_by(
[],
["val_a"],
"v:list_a",
WOQL.get(WOQL.as("Id", "v:val_a"))
.post("file-a.csv", { type: "csv" })
),
WOQL.list_to_set("v:list_a", "v:set_a"),
// Read second CSV into set B
WOQL.group_by(
[],
["val_b"],
"v:list_b",
WOQL.get(WOQL.as("Id", "v:val_b"))
.post("file-b.csv", { type: "csv" })
),
WOQL.list_to_set("v:list_b", "v:set_b"),
// Compute differences and intersection
WOQL.set_difference("v:set_a", "v:set_b", "v:only_in_a"),
WOQL.set_difference("v:set_b", "v:set_a", "v:only_in_b"),
WOQL.set_intersection("v:set_a", "v:set_b", "v:in_both"),
)Performance Considerations
Set Operations (Recommended)
The set operations provide excellent performance for large datasets:
| Dataset Size | Performance | Recommendation |
|---|---|---|
| Up to 10,000 | < 100ms | Use set operations directly |
| 10,000-100,000 | 100-600ms | Use set operations directly |
| 100,000+ | ~1 second | Consider chunked processing |
Member-Based Approach
The member-based approach has O(n²) complexity and works well for smaller datasets:
| Dataset Size | Performance | Recommendation |
|---|---|---|
| Up to 1,000 | Fast | Works well |
| 1,000-5,000 | Several seconds | Consider set operations |
| 5,000+ | May timeout | Use set operations instead |
Very Large Datasets (Millions of Records)
For very large databases or CSV files:
- Use streaming patterns - Process CSV rows one at a time with backtracking instead of collecting into lists
- Use indexed triple lookups - Let the database use indexes for individual lookups
- Collect CSV into a set, iterate database - Keep the smaller dataset in memory as a set, iterate the larger one
- Consider chunked processing - Process data in batches of 50,000-100,000 records
Summary
The recommended techniques for CSV comparison in WOQL are:
- Use
group_byto read CSV data once into a list variable - Use
list_to_setto convert lists to sorted sets - Use
set_differenceandset_intersectionfor O(n log n) comparison - Use
set_memberfor O(log n) membership checks against large sets
For comparing database content with CSV:
- Use triple patterns to query database values efficiently
- Use backtracking for memory-efficient processing of large datasets
- Collect the smaller dataset into a set, iterate the larger one
The set operations can handle 100,000+ elements in under a second, making them suitable for production synchronization tasks.