Components#

Embedding#

Class for embedding.

This module provides:

— Embedding

class grag.components.embedding.Embedding(embedding_type: str, embedding_model: str)[source]#

Bases: object

A class for vector embeddings.

Supports:: huggingface sentence transformers -> model_type = ‘sentence-transformers’ huggingface instructor embeddings -> model_type = ‘instructor-embedding’

embedding_type[source]#: embedding model type, refer above for supported types

embedding_model[source]#: huggingface model name

embedding_function[source]#: langchain embedding type

LLM#

Class for LLM.

grag.components.llm.LLM(model_name: str, quantization: str, pipeline: str, device_map: str = 'auto', task: str = 'text-generation', max_new_tokens: str = '1024', temperature: str | float = 0.1, n_batch: str | int = 1024, n_ctx: str | int = 6000, n_gpu_layers: str | int = -1, std_out: bool | str = True, base_dir: str | Path = PosixPath('models'), callbacks=None)[source]#

A class for managing and utilizing large language models (LLMs).

This class facilitates the loading and operation of large language models using different pipelines and settings. It supports both local and Hugging Face-based model management, with adjustable parameters for quantization, computational specifics, and output control.

grag.components.llm.model_name[source]#

Name of the model to be loaded.

Type:: str

grag.components.llm.quantization[source]#

Quantization setting for the model, affecting performance and memory usage.

Type:: str

grag.components.llm.pipeline[source]#

Type of pipeline (‘llama_cpp’ or ‘hf’) used for model operations.

Type:: str

grag.components.llm.device_map[source]#

Device mapping for model execution, defaults to ‘auto’.

Type:: str

grag.components.llm.task[source]#

The task for which the model is being used, defaults to ‘text-generation’.

Type:: str

grag.components.llm.max_new_tokens[source]#

Maximum number of new tokens to be generated, defaults to 1024.

Type:: int

grag.components.llm.temperature[source]#

Sampling temperature for generation, affecting randomness.

Type:: float

grag.components.llm.n_batch[source]#

Number of batches for GPU CPP, impacting batch processing.

Type:: int

grag.components.llm.n_ctx[source]#

Context size for CPP, defining the extent of context considered.

Type:: int

grag.components.llm.n_gpu_layers[source]#

Number of GPU layers for CPP, specifying computational depth.

Type:: int

grag.components.llm.std_out[source]#

Flag or descriptor for standard output during operations.

Type:: bool or str

grag.components.llm.base_dir[source]#

Base directory path for model files, defaults to ‘models’.

Type:: str or Path

grag.components.llm.callbacks[source]#

List of callback functions for additional processing.

Type:: list or None

Retriever#

Class for retriever.

This module provides:

— Retriever

grag.components.multivec_retriever.Retriever(vectordb: VectorDB | None = None, store_path: str | Path = PosixPath('data/doc_store'), top_k: str | int = 3, id_key: str = 'doc_id', namespace: str = '71e4b558187b270922923569301f1039', client_kwargs: Dict[str, Any] | None = None)[source]#

A class for multi vector retriever.

It connects to a vector database and a local file store. It is used to return most similar chunks from a vector store but has the additional functionality to return a linked document, chunk, etc.

grag.components.multivec_retriever.vectordb[source]#: ChromaClient class instance from components.client (Optional, if the user provides it, store_path, id_key and namespace is not considered)

grag.components.multivec_retriever.store_path[source]#: Path to the local file store

grag.components.multivec_retriever.id_key[source]#: A key prefix for identifying documents

grag.components.multivec_retriever.store[source]#: langchain.storage.LocalFileStore object, stores the key value pairs of document id and parent file

grag.components.multivec_retriever.retriever[source]#: langchain.retrievers.multi_vector.MultiVectorRetriever class instance, langchain’s multi-vector retriever

grag.components.multivec_retriever.splitter[source]#: TextSplitter class instance from components.text_splitter

grag.components.multivec_retriever.namespace[source]#: Namespace for producing unique id

grag.components.multivec_retriever.top_k[source]#: Number of top chunks to return from similarity search.

Parse PDF#

Classes for parsing files.

This module provides:

— ParsePDF

grag.components.parse_pdf.ParsePDF(single_text_out: bool = True, strategy: str = 'hi_res', infer_table_structure: bool = True, extract_images: bool = True, image_output_dir: str | None = None, add_captions_to_text: bool = True, add_captions_to_blocks: bool = True, add_caption_first: bool = True, table_as_html: bool = False)[source]#

Parsing and partitioning PDF documents into Text, Table or Image elements.

grag.components.parse_pdf.single_text_out[source]#

Whether to combine all text elements into a single output document.

Type:: bool

grag.components.parse_pdf.strategy[source]#

The strategy for PDF partitioning; default is “hi_res” for better accuracy.

Type:: str

grag.components.parse_pdf.infer_table_structure[source]#

Whether to extract tables during partitioning.

Type:: bool

grag.components.parse_pdf.extract_images[source]#

Whether to extract images.

Type:: bool

grag.components.parse_pdf.image_output_dir[source]#

Directory to save extracted images, if any.

Type:: str

grag.components.parse_pdf.add_captions_to_text[source]#

Whether to include figure captions in text output. Default is True.

Type:: bool

grag.components.parse_pdf.add_captions_to_blocks[source]#

Whether to add captions to table and image blocks. Default is True.

Type:: bool

grag.components.parse_pdf.add_caption_first[source]#

Whether to place captions before their corresponding image or table in the output. Default is True.

Type:: bool

grag.components.parse_pdf.table_as_html[source]#

Whether to add table elements as HTML. Default is False.

Type:: bool

Prompt#

Classes for prompts.

This module provides:

— Prompt: for generic prompts

— FewShotPrompt: for few-shot prompts

class grag.components.prompt.FewShotPrompt(*, name: str = 'custom_prompt', llm_type: str = 'None', task: str = 'QA', source: str = 'NoSource', doc_chain: str = 'stuff', language: str = 'en', filepath: str | None = None, input_keys: List[str], template: str, prompt: PromptTemplate | None = None, output_keys: List[str], examples: List[Dict[str, Any]], prefix: str, suffix: str, example_template: str)[source]#

Bases: Prompt

A class for generic prompts.

name[source]#

The prompt name (Optional, defaults to “custom_prompt”) (Parent Class)

Type:: str

llm_type[source]#

The type of llm, llama2, etc (Optional, defaults to “None”) (Parent Class)

Type:: str

task[source]#

The task (Optional, defaults to QA) (Parent Class)

Type:: str

source[source]#

The source of the prompt (Optional, defaults to “NoSource”) (Parent Class)

Type:: str

doc_chain[source]#

The doc chain for the prompt (“stuff”, “refine”) (Optional, defaults to “stuff”) (Parent Class)

Type:: str

language[source]#

The language of the prompt (Optional, defaults to “en”) (Parent Class)

Type:: str

filepath[source]#

The filepath of the prompt (Optional) (Parent Class)

Type:: str

input_keys[source]#

The input keys for the prompt (Parent Class)

Type:: List[str]

input_keys[source]#

The output keys for the prompt

Type:: List[str]

prefix[source]#

The template prefix for the prompt

Type:: str

suffix[source]#

The template suffix for the prompt

Type:: str

example_template[source]#

The template for formatting the examples

Type:: str

examples[source]#

The list of examples, each example is a dictionary with respective keys

Type:: List[Dict[str, Any]]

example_template: str[source]#

examples: List[Dict[str, Any]][source]#

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}[source]#: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}[source]#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'doc_chain': FieldInfo(annotation=str, required=False, default='stuff'), 'example_template': FieldInfo(annotation=str, required=True), 'examples': FieldInfo(annotation=List[Dict[str, Any]], required=True), 'filepath': FieldInfo(annotation=Union[str, NoneType], required=False, default=None, exclude=True), 'input_keys': FieldInfo(annotation=List[str], required=True), 'language': FieldInfo(annotation=str, required=False, default='en'), 'llm_type': FieldInfo(annotation=str, required=False, default='None'), 'name': FieldInfo(annotation=str, required=False, default='custom_prompt'), 'output_keys': FieldInfo(annotation=List[str], required=True), 'prefix': FieldInfo(annotation=str, required=True), 'prompt': FieldInfo(annotation=Union[PromptTemplate, NoneType], required=False, default=None, exclude=True, repr=False), 'source': FieldInfo(annotation=str, required=False, default='NoSource'), 'suffix': FieldInfo(annotation=str, required=True), 'task': FieldInfo(annotation=str, required=False, default='QA'), 'template': FieldInfo(annotation=str, required=True)}[source]#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

output_keys: List[str][source]#

prefix: str[source]#

suffix: str[source]#

classmethod validate_examples(v) → List[Dict[str, Any]][source]#: Validate the examples field.

classmethod validate_output_keys(v) → List[str][source]#: Validate the output_keys field.

class grag.components.prompt.Prompt(*, name: str = 'custom_prompt', llm_type: str = 'None', task: str = 'QA', source: str = 'NoSource', doc_chain: str = 'stuff', language: str = 'en', filepath: str | None = None, input_keys: List[str], template: str, prompt: PromptTemplate | None = None)[source]#

Bases: BaseModel

A class for generic prompts.

name[source]#

The prompt name (Optional, defaults to “custom_prompt”)

Type:: str

llm_type[source]#

The type of llm, llama2, etc (Optional, defaults to “None”)

Type:: str

task[source]#

The task (Optional, defaults to QA)

Type:: str

source[source]#

The source of the prompt (Optional, defaults to “NoSource”)

Type:: str

doc_chain[source]#

The doc chain for the prompt (“stuff”, “refine”) (Optional, defaults to “stuff”)

Type:: str

language[source]#

The language of the prompt (Optional, defaults to “en”)

Type:: str

filepath[source]#

The filepath of the prompt (Optional)

Type:: str

input_keys[source]#

The input keys for the prompt

Type:: List[str]

template (str): The template for the prompt

doc_chain: str[source]#

filepath: str | None[source]#

format(**kwargs) → str[source]#: Formats the prompt with provided keys and returns a string.

input_keys: List[str][source]#

language: str[source]#

llm_type: str[source]#

classmethod load(filepath: Path | str)[source]#: Loads a json file and returns a Prompt class.

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}[source]#: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}[source]#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'doc_chain': FieldInfo(annotation=str, required=False, default='stuff'), 'filepath': FieldInfo(annotation=Union[str, NoneType], required=False, default=None, exclude=True), 'input_keys': FieldInfo(annotation=List[str], required=True), 'language': FieldInfo(annotation=str, required=False, default='en'), 'llm_type': FieldInfo(annotation=str, required=False, default='None'), 'name': FieldInfo(annotation=str, required=False, default='custom_prompt'), 'prompt': FieldInfo(annotation=Union[PromptTemplate, NoneType], required=False, default=None, exclude=True, repr=False), 'source': FieldInfo(annotation=str, required=False, default='NoSource'), 'task': FieldInfo(annotation=str, required=False, default='QA'), 'template': FieldInfo(annotation=str, required=True)}[source]#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

name: str[source]#

prompt: PromptTemplate | None[source]#

save(filepath: Path | str | None, overwrite=False) → None | ValueError[source]#: Saves the prompt class into a json file.

source: str[source]#

task: str[source]#

template: str[source]#

classmethod validate_doc_chain(v: str) → str[source]#: Validate the doc_chain field.

classmethod validate_input_keys(v) → List[str][source]#: Validate the input_keys field.

classmethod validate_task(v: str) → str[source]#: Validate the task field.

Text Splitter#

Class for splitting/chunking text.

This module provides:

— TextSplitter

grag.components.text_splitter.TextSplitter(chunk_size: int | str = 2000, chunk_overlap: int | str = 400)[source]#

Class for recursively chunking text, it prioritizes ‘/n/n then ‘/n’ and so on.

grag.components.text_splitter.chunk_size[source]#: maximum size of chunk

grag.components.text_splitter.chunk_overlap[source]#: chunk overlap size

Utils#

Utils functions.

This module provides:

— stuff_docs: concats langchain documents into string

— load_prompt: loads json prompt to langchain prompt

— find_config_path: finds the path of the ‘config.ini’ file by traversing up the directory tree from the current path.

— get_config: retrieves and parses the configuration settings from the ‘config.ini’ file.

— configure_args: a decorator to configure class instantiation arguments from a ‘config.ini’ file.

grag.components.utils.configure_args(cls)[source]#

Decorator to configure class instantiation arguments from a ‘config.ini’ file, based on the class’s module name.

This function reads configuration specific to a class’s module from ‘config.ini’, then uses it to override or provide defaults for keyword arguments passed during class instantiation.

Parameters:: cls (class) – The class whose instantiation is to be configured.
Returns:: A wrapped class constructor that uses modified arguments based on the configuration.
Return type:: function
Raises:: TypeError – If there is a mismatch in provided arguments and class constructor requirements.

grag.components.utils.find_config_path(current_path: Path)[source]#

Finds the path of the ‘config.ini’ file by traversing up the directory tree from the current path.

This function starts at the current path and moves up the directory tree until it finds a file named ‘config.ini’. If ‘config.ini’ is not found by the time the root of the directory tree is reached, None is returned.

Parameters:: current_path (Path) – The starting point for the search, typically the location of the script being executed.
Returns:: None or the path to the found ‘config.ini’ file.
Return type:: Path

grag.components.utils.get_config(load_env=False)[source]#

Retrieves and parses the configuration settings from the ‘config.ini’ file.

This function locates the ‘config.ini’ file by calling find_config_path using the script’s current location. It initializes a ConfigParser object to read the configuration settings from the located ‘config.ini’ file. Optionally, it can also load environment variables from a .env file specified in the config. If a config file cannot be read, a default dictionary is returned.

Parameters:

load_env (bool) – If True, load environment variables from the path specified in the ‘config.ini’. Defaults to False.

Returns:

A parser object containing the configuration settings from ‘config.ini’, or a defaultdict: with None if the file is not found or an empty dict{dict{}}.

Return type:

ConfigParser

grag.components.utils.stuff_docs(docs: List[Document]) → str[source]#

Concatenates langchain documents into a string using ‘nn’ seperator.

Parameters:: docs – List of langchain_core.documents.Document
Returns:: string of document page content joined by ‘nn’

Components#

VectorDB#

Embedding#

LLM#

Retriever#

Parse PDF#

Prompt#

Text Splitter#

Utils#

Module contents#