Preparers chunk text operation examples

These examples use preparers with the ChunkText operation in AI Accelerator.

Tip

This operation transforms the shape of the data, automatically unnesting collections by introducing a part_id column. See the unnesting concept for more detail.

Primitive

-- Only specify a desired length
SELECT * FROM aidb.chunk_text('This is a simple test sentence.', '{"desired_length": 10}');
Output
 part_id |   chunk
---------+-----------
       0 | This is a
       1 | simple
       2 | test
       3 | sentence.
(4 rows)
-- Specify a desired length and a maximum length
SELECT * FROM aidb.chunk_text('This is a simple test sentence.', '{"desired_length": 10, "max_length": 15}');
Output
 part_id |    chunk
---------+-------------
       0 | This is a
       1 | simple test
       2 | sentence.
(3 rows)
-- Named parameters
SELECT * FROM aidb.chunk_text(
    input => 'This is a significantly longer text example that might require splitting into smaller chunks. The purpose of this function is to partition text data into segments of a specified maximum length, for example, this sentence 145 is characters. This enables processing or storage of data in manageable parts.',
    options => '{"desired_length": 40}'
);
Output
 part_id |                 chunk
---------+----------------------------------------
       0 | This is a significantly longer text
       1 | example that might require splitting
       2 | into smaller chunks.
       3 | The purpose of this function is to
       4 | partition text data into segments of a
       5 | specified maximum length, for example,
       6 | this sentence 145 is characters.
       7 | This enables processing or storage of
       8 | data in manageable parts.
(9 rows)
-- Semantic chunking to split into the largest continuous semantic chunk that fits in the max_length
SELECT * FROM aidb.chunk_text('This sentence should be its own chunk. This too.', '{"desired_length": 1, "max_length": 1000}');
Output
 part_id |                 chunk
---------+----------------------------------------
       0 | This sentence should be its own chunk.
       1 | This too.
(2 rows)

Preparer with table data source

-- Create source test table
CREATE TABLE source_table__1628
(
    id      INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    content TEXT NOT NULL
);
INSERT INTO source_table__1628
VALUES (1, 'This is a significantly longer text example that might require splitting into smaller chunks. The purpose of this function is to partition text data into segments of a specified maximum length, for example, this sentence 145 is characters. This enables processing or storage of data in manageable parts.'),
       (2, 'This sentence should be its own chunk. This too.');

SELECT aidb.create_table_preparer(
    name => 'preparer__1628',
    operation => 'ChunkText',
    source_table => 'source_table__1628',
    source_data_column => 'content',
    destination_table => 'chunked_data__1628',
    destination_data_column => 'chunks',
    source_key_column => 'id',
    destination_key_column => 'id',
    options => '{"desired_length": 1, "max_length": 1000}'::JSONB  -- Configuration for the ChunkText operation
);

SELECT aidb.bulk_data_preparation('preparer__1628');

SELECT * FROM chunked_data__1628;
Output
 id | part_id | unique_id |                                                                      chunks
----+---------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------
 1  |       0 | 1.part.0  | This is a significantly longer text example that might require splitting into smaller chunks.
 1  |       1 | 1.part.1  | The purpose of this function is to partition text data into segments of a specified maximum length, for example, this sentence 145 is characters.
 1  |       2 | 1.part.2  | This enables processing or storage of data in manageable parts.
 2  |       0 | 2.part.0  | This sentence should be its own chunk.
 2  |       1 | 2.part.1  | This too.
(5 rows)

Could this page be better? Report a problem or suggest an addition!