Commit Graph

2 Commits

Author SHA1 Message Date
Greyson LaLonde
e29ca9ec28 feat: replace embedchain with native crewai adapter (#451)
- Remove embedchain adapter; add crewai rag adapter and update all search tools  
- Add loaders: pdf, youtube (video & channel), github, docs site, mysql, postgresql  
- Add configurable similarity threshold, limit params, and embedding_model support  
- Improve chromadb compatibility (sanitize metadata, convert columns, fix chunking)  
- Fix xml encoding, Python 3.10 issues, and youtube url spoofing  
- Update crewai dependency and instructions; refresh uv.lock  
- Update tests for new rag adapter and search params
2025-09-18 19:02:22 -04:00
Lucas Gomide
dc039cfac8 Adds RAG feature (#406)
* feat: initialize rag

* refactor: using cosine distance metric for chromadb

* feat: use RecursiveCharacterTextSplitter as chunker strategy

* feat: support chucker and loader per data_type

* feat: adding JSON loader

* feat: adding CSVLoader

* feat: adding loader for DOCX files

* feat: add loader for MDX files

* feat: add loader for XML files

* feat: add loader for parser Webpage

* feat: support to load files from an entire directory

* feat: support to auto-load the loaders for additional DataType

* feat: add chuckers for some specific data type

- Each chunker uses separators specific to its content type

* feat: prevent document duplication and centralize content management

- Implement document deduplication logic in RAG
  * Check for existing documents by source reference
  * Compare doc IDs to detect content changes
  * Automatically replace outdated content while preventing duplicates

- Centralize common functionality for better maintainability
  * Create SourceContent class to handle URLs, files, and text uniformly
  * Extract shared utilities (compute_sha256) to misc.py
  * Standardize doc ID generation across all loaders

- Improve RAG system architecture
  * All loaders now inherit consistent patterns from centralized BaseLoader
  * Better separation of concerns with dedicated content management classes
  * Standardized LoaderResult structure across all loader implementations

* chore: split text loaders file

* test: adding missing tests about RAG loaders

* refactor: QOL

* fix: add missing uv syntax on DOCXLoader
2025-08-19 18:30:35 -04:00