crewAI

andre/crewAI

Fork 0

mirror of https://github.com/crewAIInc/crewAI.git synced 2026-01-10 16:48:30 +00:00

Commit Graph

Author	SHA1	Message	Date
Lucas Gomide	dc039cfac8	Adds RAG feature (#406 ) * feat: initialize rag * refactor: using cosine distance metric for chromadb * feat: use RecursiveCharacterTextSplitter as chunker strategy * feat: support chucker and loader per data_type * feat: adding JSON loader * feat: adding CSVLoader * feat: adding loader for DOCX files * feat: add loader for MDX files * feat: add loader for XML files * feat: add loader for parser Webpage * feat: support to load files from an entire directory * feat: support to auto-load the loaders for additional DataType * feat: add chuckers for some specific data type - Each chunker uses separators specific to its content type * feat: prevent document duplication and centralize content management - Implement document deduplication logic in RAG * Check for existing documents by source reference * Compare doc IDs to detect content changes * Automatically replace outdated content while preventing duplicates - Centralize common functionality for better maintainability * Create SourceContent class to handle URLs, files, and text uniformly * Extract shared utilities (compute_sha256) to misc.py * Standardize doc ID generation across all loaders - Improve RAG system architecture * All loaders now inherit consistent patterns from centralized BaseLoader * Better separation of concerns with dedicated content management classes * Standardized LoaderResult structure across all loader implementations * chore: split text loaders file * test: adding missing tests about RAG loaders * refactor: QOL * fix: add missing uv syntax on DOCXLoader	2025-08-19 18:30:35 -04:00

Author

SHA1

Message

Date

Lucas Gomide

dc039cfac8

Adds RAG feature (#406 )

* feat: initialize rag

* refactor: using cosine distance metric for chromadb

* feat: use RecursiveCharacterTextSplitter as chunker strategy

* feat: support chucker and loader per data_type

* feat: adding JSON loader

* feat: adding CSVLoader

* feat: adding loader for DOCX files

* feat: add loader for MDX files

* feat: add loader for XML files

* feat: add loader for parser Webpage

* feat: support to load files from an entire directory

* feat: support to auto-load the loaders for additional DataType

* feat: add chuckers for some specific data type

- Each chunker uses separators specific to its content type

* feat: prevent document duplication and centralize content management

- Implement document deduplication logic in RAG
  * Check for existing documents by source reference
  * Compare doc IDs to detect content changes
  * Automatically replace outdated content while preventing duplicates

- Centralize common functionality for better maintainability
  * Create SourceContent class to handle URLs, files, and text uniformly
  * Extract shared utilities (compute_sha256) to misc.py
  * Standardize doc ID generation across all loaders

- Improve RAG system architecture
  * All loaders now inherit consistent patterns from centralized BaseLoader
  * Better separation of concerns with dedicated content management classes
  * Standardized LoaderResult structure across all loader implementations

* chore: split text loaders file

* test: adding missing tests about RAG loaders

* refactor: QOL

* fix: add missing uv syntax on DOCXLoader

2025-08-19 18:30:35 -04:00

1 Commits