Pipelines knowledge base performance tuning
Background
The performance (i.e., throughput of embeddings per second) can be optimized by changing pipeline and model settings. This guide explains the relevant settings and shows how to tune them.
Knowledge base piplines process collections of individual records (rows in a table or objects in a volume). Rather than processing each record individually and sequentially, or processing all of them concurrently, AIDB offers batch processing. All the batches get processed sequentially, one after the other. Within each batch, records get processed concurrently wherever possible.
- Pipeline
batch_size
determines how many records each batch should have - Some model providers have configurable internal batch/parallel processing. We recommend leaving these setting at the default values and using the pipeline batch size to control execution.
Note
vector indexing also has an impact on pipeline performance. You can disable the vector by using index_type => 'disabled'
to exclude it from your measurements.
Testing and tuning performance
We will first set up test data and a knowledge base pipeline, then measure and tune the batch size.
1) Create a table and insert test data
The actual data content length has some impact on model performance. You can use longer text to test that.
CREATE TABLE test_data_10k (id INT PRIMARY KEY, msg TEXT NOT NULL); INSERT INTO test_data_10k (id, msg) SELECT generate_series(1, 10000) AS id, 'hello world';
2) Create a knowledge base pipeline
The optimal batch size may be very different for different models. Measure and tune the batch size for each different model you want to use.
SELECT aidb.create_table_knowledge_base( name => 'perf_test_b', model_name => 'dummy', -- use the model you want to optimize for source_table => 'test_data_10k', source_data_column => 'msg', source_data_format => 'Text', index_type => 'disabled', -- optionally disable vector indexing to include/exclude it from the measurement auto_processing => 'Disabled', -- we want to manually run the pipeline to measure the runtime batch_size => 100 -- this is the paramter we will tune during this test );
INFO: using vector table: public.perf_test_vector NOTICE: index "vdx_perf_test_vector" does not exist, skipping NOTICE: auto-processing is set to "Disabled". Manually run "SELECT aidb.bulk_embedding('perf_test');" to compute embeddings. create_table_knowledge_base ----------------------------- perf_test (1 row)
3) Run the pipeline, measure the performance
We use psql
in this test; the \timing on
command is a feature in psql. If you use a different interface, check how it can display timing information.
\timing on
Timing is on.
Now run the pipeline:
SELECT aidb.bulk_embedding('perf_test');
INFO: perf_test: (re)setting state table to process all data... INFO: perf_test: Starting... Batch size 100, unprocessed rows: 10000, count(source records): 10000, count(embeddings): 0 INFO: perf_test: Batch iteration finished, unprocessed rows: 9900, count(source records): 10000, count(embeddings): 100 INFO: perf_test: Batch iteration finished, unprocessed rows: 9800, count(source records): 10000, count(embeddings): 200 ... INFO: perf_test: Batch iteration finished, unprocessed rows: 0, count(source records): 10000, count(embeddings): 10000 INFO: perf_test: finished, unprocessed rows: 0, count(source records): 10000, count(embeddings): 10000 bulk_embedding ---------------- (1 row) Time: 207161,174 ms (03:27,161)
4) Tune the batch size
You can use this call to adjust the batch size of the pipeline. We increase by 10x to 1000 records:
SELECT aidb.set_auto_knowledge_base('perf_test', 'Disabled', batch_size=>1000);
Run the pipeline again.
Note
When using a Postgres table as the source, with auto-processing disabled, AIDB has no means to detect changes in the source data. So each bulk_embedding call has to re-process everything.
This is convenient for performance testing.
If you want to measure performance with a volumes source, you should delete and re-create the knowledge base between each test. AIDB is able to detect changes on volumes even with auto-procesing disabled.
SELECT aidb.bulk_embedding('perf_test');
INFO: perf_test: (re)setting state table to process all data... INFO: perf_test: Starting... Batch size 1000, unprocessed rows: 10000, count(source records): 10000, count(embeddings): 10000 ... INFO: perf_test: finished, unprocessed rows: 0, count(source records): 10000, count(embeddings): 10000 bulk_embedding ---------------- (1 row) Time: 154276,486 ms (02:34,276)
Conclusion
In this test, the pipeline took 02:34 min with batch size 1000 and 03:27 min with size 100. You can continue testing larger sizes until performance no longer improves, or even declines.
← Prev
Knowledge bases end-to-end example usage
↑ Up
AI Accelerator Pipelines knowledge bases
Next →
Pipelines knowledge bases usage auto processing
Could this page be better? Report a problem or suggest an addition!