Identify the preprocessing term that splits a do…

A company is building a configuration to ingest a long internal manual into an Amazon Bedrock knowledge base and search only the relevant sections to use in generating answers. Prior to ingestion, the document must be split into meaningful units suitable for embedding conversion and search. Which term describes this splitting process?

1 / 1

Select an answer

CorrectC

Explanation

Question Overview

Identify the preprocessing term that splits a document prior to knowledge base ingestion.

Requirements to satisfy

1「search only the relevant sections」The document must be prepared as meaningful units for search
2「split into meaningful units」Chunking (coarser granularity than tokenization) is applicable

Per-option explanation

AIncorrect

Tokenization

This is incorrect. Tokenization is the process of splitting text into the smallest units (tokens) that a model processes. While both involve 'splitting,' the granularity is different — splitting at the word or sub-word level is too fine-grained for search units. The splitting into units for search is chunking.

BIncorrect

Fine-tuning

This is incorrect. Fine-tuning is a technique that additionally trains model weights with labeled data to adapt the model to a specific task. Using a pre-trained foundation model as a base, additional training on company-specific data such as inquiry responses or domain-specific text causes the model to learn styles and specialized knowledge (unlike RAG which 'retrieves and passes external documents,' fine-tuning 'updates the model's internal content'). The question is about a preprocessing step for knowledge base ingestion, not model retraining.

CCorrect

Chunking

This is correct. Chunking is the preprocessing of splitting a long document into meaningful units (chunks) suitable for search and embedding. For example, a several-hundred-page internal manual is split at heading and paragraph boundaries into 'units of a few hundred to a thousand characters,' with one unit such as 'Chapter 3: Expense Reimbursement Procedure' treated as one chunk. In a knowledge base, chunks are ingested in the order: chunking → embedding generation → vector storage, and only the chunks closest to the question are retrieved for use in generating answers.

DIncorrect

Embedding generation

This is incorrect. Embedding generation is the process of converting already-split units (chunks) into semantic vectors, and it is a downstream step of chunking in the ingestion flow. Each chunk is passed through an embedding model to produce a vector of hundreds to thousands of numbers, which is then stored in a vector store. This allows later vectorizing the question in the same way and searching for chunks with the smallest distance by meaning. The question asks about the 'splitting' step itself, so the order is different.

Key Takeaway

Memorize the RAG/knowledge base ingestion order as 'chunking (splitting) → embedding generation (vectorization) → vector store storage.' The easily confused tokenization is splitting into 'minimum units for the model to read,' which is much finer granularity, and embedding generation is a downstream step of splitting. 'Splitting into units for search' = chunking.

Explanation

💡Key Takeaway

Related Links

Key Takeaway