Compare commits

...

4 Commits

Author SHA1 Message Date
Tony Kipkemboi
0695c26703 Merge branch 'main' into brandon/cre-510-update-docs-to-talk-about-pydantic-and-json-outputs 2024-12-04 11:05:47 -05:00
Brandon Hancock
4fb3331c6a Talk about getting structured consistent outputs with tasks. 2024-12-04 10:46:39 -05:00
Stephen
b6c6eea6f5 Update README.md (#1694)
Corrected the statement which says users can not disable telemetry, but now users can disable by setting the environment variable OTEL_SDK_DISABLED to true.

Co-authored-by: Brandon Hancock (bhancock_ai) <109994880+bhancockio@users.noreply.github.com>
2024-12-03 16:08:19 -05:00
Lorenze Jay
1af95f5146 Knowledge project directory standard (#1691)
* Knowledge project directory standard

* fixed types

* comment fix

* made base file knowledge source an abstract class

* cleaner validator on model_post_init

* fix type checker

* cleaner refactor

* better template
2024-12-03 12:27:48 -08:00
13 changed files with 286 additions and 52 deletions

View File

@@ -376,7 +376,7 @@ pip install dist/*.tar.gz
CrewAI uses anonymous telemetry to collect usage data with the main purpose of helping us improve the library by focusing our efforts on the most used features, integrations and tools.
It's pivotal to understand that **NO data is collected** concerning prompts, task descriptions, agents' backstories or goals, usage of tools, API calls, responses, any data processed by the agents, or secrets and environment variables, with the exception of the conditions mentioned. When the `share_crew` feature is enabled, detailed data including task descriptions, agents' backstories or goals, and other specific attributes are collected to provide deeper insights while respecting user privacy. We don't offer a way to disable it now, but we will in the future.
It's pivotal to understand that **NO data is collected** concerning prompts, task descriptions, agents' backstories or goals, usage of tools, API calls, responses, any data processed by the agents, or secrets and environment variables, with the exception of the conditions mentioned. When the `share_crew` feature is enabled, detailed data including task descriptions, agents' backstories or goals, and other specific attributes are collected to provide deeper insights while respecting user privacy. Users can disable telemetry by setting the environment variable OTEL_SDK_DISABLED to true.
Data collected includes:

View File

@@ -156,14 +156,35 @@ crew = Crew(
agents=[agent],
tasks=[task],
knowledge_sources=[source],
embedder_config={
"model": "BAAI/bge-small-en-v1.5",
"normalize": True,
"max_length": 512
embedder={
"provider": "ollama",
"config": {"model": "nomic-embed-text:latest"},
}
)
```
### Referencing Sources
You can reference knowledge sources by their collection name or metadata.
* Add a directory to your crew project called `knowledge`:
* File paths in knowledge can be referenced relative to the `knowledge` directory.
Example:
A file inside the `knowledge` directory called `example.txt` can be referenced as `example.txt`.
```python
source = TextFileKnowledgeSource(
file_path="example.txt", # or /example.txt
collection_name="example"
)
crew = Crew(
agents=[agent],
tasks=[task],
knowledge_sources=[source],
)
```
## Best Practices
<AccordionGroup>

View File

@@ -263,6 +263,167 @@ analysis_task = Task(
)
```
## Getting Structured Consistent Outputs from Tasks
When you need to ensure that a task outputs a structured and consistent format, you can use the `output_pydantic` or `output_json` properties on a task. These properties allow you to define the expected output structure, making it easier to parse and utilize the results in your application.
<Note>
It's also important to note that the output of the final task of a crew becomes the final output of the actual crew itself.
</Note>
### Using `output_pydantic`
The `output_pydantic` property allows you to define a Pydantic model that the task output should conform to. This ensures that the output is not only structured but also validated according to the Pydantic model.
Heres an example demonstrating how to use output_pydantic:
```python Code
import json
from crewai import Agent, Crew, Process, Task
from pydantic import BaseModel
class Blog(BaseModel):
title: str
content: str
blog_agent = Agent(
role="Blog Content Generator Agent",
goal="Generate a blog title and content",
backstory="""You are an expert content creator, skilled in crafting engaging and informative blog posts.""",
verbose=False,
allow_delegation=False,
llm="gpt-4o",
)
task1 = Task(
description="""Create a blog title and content on a given topic. Make sure the content is under 200 words.""",
expected_output="A compelling blog title and well-written content.",
agent=blog_agent,
output_pydantic=Blog,
)
# Instantiate your crew with a sequential process
crew = Crew(
agents=[blog_agent],
tasks=[task1],
verbose=True,
process=Process.sequential,
)
result = crew.kickoff()
# Option 1: Accessing Properties Using Dictionary-Style Indexing
print("Accessing Properties - Option 1")
title = result["title"]
content = result["content"]
print("Title:", title)
print("Content:", content)
# Option 2: Accessing Properties Directly from the Pydantic Model
print("Accessing Properties - Option 2")
title = result.pydantic.title
content = result.pydantic.content
print("Title:", title)
print("Content:", content)
# Option 3: Accessing Properties Using the to_dict() Method
print("Accessing Properties - Option 3")
output_dict = result.to_dict()
title = output_dict["title"]
content = output_dict["content"]
print("Title:", title)
print("Content:", content)
# Option 4: Printing the Entire Blog Object
print("Accessing Properties - Option 5")
print("Blog:", result)
```
In this example:
* A Pydantic model Blog is defined with title and content fields.
* The task task1 uses the output_pydantic property to specify that its output should conform to the Blog model.
* After executing the crew, you can access the structured output in multiple ways as shown.
#### Explanation of Accessing the Output
1. Dictionary-Style Indexing: You can directly access the fields using result["field_name"]. This works because the CrewOutput class implements the __getitem__ method.
2. Directly from Pydantic Model: Access the attributes directly from the result.pydantic object.
3. Using to_dict() Method: Convert the output to a dictionary and access the fields.
4. Printing the Entire Object: Simply print the result object to see the structured output.
### Using `output_json`
The `output_json` property allows you to define the expected output in JSON format. This ensures that the task's output is a valid JSON structure that can be easily parsed and used in your application.
Heres an example demonstrating how to use `output_json`:
```python Code
import json
from crewai import Agent, Crew, Process, Task
from pydantic import BaseModel
# Define the Pydantic model for the blog
class Blog(BaseModel):
title: str
content: str
# Define the agent
blog_agent = Agent(
role="Blog Content Generator Agent",
goal="Generate a blog title and content",
backstory="""You are an expert content creator, skilled in crafting engaging and informative blog posts.""",
verbose=False,
allow_delegation=False,
llm="gpt-4o",
)
# Define the task with output_json set to the Blog model
task1 = Task(
description="""Create a blog title and content on a given topic. Make sure the content is under 200 words.""",
expected_output="A JSON object with 'title' and 'content' fields.",
agent=blog_agent,
output_json=Blog,
)
# Instantiate the crew with a sequential process
crew = Crew(
agents=[blog_agent],
tasks=[task1],
verbose=True,
process=Process.sequential,
)
# Kickoff the crew to execute the task
result = crew.kickoff()
# Option 1: Accessing Properties Using Dictionary-Style Indexing
print("Accessing Properties - Option 1")
title = result["title"]
content = result["content"]
print("Title:", title)
print("Content:", content)
# Option 2: Printing the Entire Blog Object
print("Accessing Properties - Option 2")
print("Blog:", result)
```
In this example:
* A Pydantic model Blog is defined with title and content fields, which is used to specify the structure of the JSON output.
* The task task1 uses the output_json property to indicate that it expects a JSON output conforming to the Blog model.
* After executing the crew, you can access the structured JSON output in two ways as shown.
#### Explanation of Accessing the Output
1. Accessing Properties Using Dictionary-Style Indexing: You can access the fields directly using result["field_name"]. This is possible because the CrewOutput class implements the __getitem__ method, allowing you to treat the output like a dictionary. In this option, we're retrieving the title and content from the result.
2. Printing the Entire Blog Object: By printing result, you get the string representation of the CrewOutput object. Since the __str__ method is implemented to return the JSON output, this will display the entire output as a formatted string representing the Blog object.
---
By using output_pydantic or output_json, you ensure that your tasks produce outputs in a consistent and structured format, making it easier to process and utilize the data within your application or across multiple tasks.
## Integrating Tools with Tasks
Leverage tools from the [CrewAI Toolkit](https://github.com/joaomdmoura/crewai-tools) and [LangChain Tools](https://python.langchain.com/docs/integrations/tools) for enhanced task performance and agent interaction.
@@ -471,4 +632,4 @@ save_output_task = Task(
Tasks are the driving force behind the actions of agents in CrewAI.
By properly defining tasks and their outcomes, you set the stage for your AI agents to work effectively, either independently or as a collaborative unit.
Equipping tasks with appropriate tools, understanding the execution process, and following robust validation practices are crucial for maximizing CrewAI's potential,
ensuring agents are effectively prepared for their assignments and that tasks are executed as intended.
ensuring agents are effectively prepared for their assignments and that tasks are executed as intended.

View File

@@ -39,6 +39,7 @@ def create_folder_structure(name, parent_folder=None):
folder_path.mkdir(parents=True)
(folder_path / "tests").mkdir(exist_ok=True)
(folder_path / "knowledge").mkdir(exist_ok=True)
if not parent_folder:
(folder_path / "src" / folder_name).mkdir(parents=True)
(folder_path / "src" / folder_name / "tools").mkdir(parents=True)
@@ -52,7 +53,14 @@ def copy_template_files(folder_path, name, class_name, parent_folder):
templates_dir = package_dir / "templates" / "crew"
root_template_files = (
[".gitignore", "pyproject.toml", "README.md"] if not parent_folder else []
[
".gitignore",
"pyproject.toml",
"README.md",
"knowledge/user_preference.txt",
]
if not parent_folder
else []
)
tools_template_files = ["tools/custom_tool.py", "tools/__init__.py"]
config_template_files = ["config/agents.yaml", "config/tasks.yaml"]
@@ -168,7 +176,9 @@ def create_crew(name, provider=None, skip_provider=False, parent_folder=None):
templates_dir = package_dir / "templates" / "crew"
root_template_files = (
[".gitignore", "pyproject.toml", "README.md"] if not parent_folder else []
[".gitignore", "pyproject.toml", "README.md", "knowledge/user_preference.txt"]
if not parent_folder
else []
)
tools_template_files = ["tools/custom_tool.py", "tools/__init__.py"]
config_template_files = ["config/agents.yaml", "config/tasks.yaml"]

View File

@@ -1,8 +1,9 @@
from crewai import Agent, Crew, Process, Task
from crewai.project import CrewBase, agent, crew, task, before_kickoff, after_kickoff
# Uncomment the following line to use an example of a custom tool
# from {{folder_name}}.tools.custom_tool import MyCustomTool
# Uncomment the following line to use an example of a knowledge source
# from crewai.knowledge.source.text_file_knowledge_source import TextFileKnowledgeSource
# Check our tools documentations for more information on how to use them
# from crewai_tools import SerperDevTool
@@ -57,10 +58,20 @@ class {{crew_name}}():
@crew
def crew(self) -> Crew:
"""Creates the {{crew_name}} crew"""
# You can add knowledge sources here
# knowledge_path = "user_preference.txt"
# sources = [
# TextFileKnowledgeSource(
# file_path="knowledge/user_preference.txt",
# metadata={"preference": "personal"}
# ),
# ]
return Crew(
agents=self.agents, # Automatically created by the @agent decorator
tasks=self.tasks, # Automatically created by the @task decorator
process=Process.sequential,
verbose=True,
# process=Process.hierarchical, # In case you wanna use that instead https://docs.crewai.com/how-to/Hierarchical/
# knowledge_sources=sources, # In the case you want to add knowledge sources
)

View File

@@ -0,0 +1,4 @@
User name is John Doe.
User is an AI Engineer.
User is interested in AI Agents.
User is based in San Francisco, California.

View File

@@ -1,36 +1,72 @@
from abc import ABC, abstractmethod
from pathlib import Path
from typing import Union, List
from typing import Union, List, Dict, Any
from pydantic import Field
from crewai.knowledge.source.base_knowledge_source import BaseKnowledgeSource
from typing import Dict, Any
from crewai.utilities.logger import Logger
from crewai.knowledge.storage.knowledge_storage import KnowledgeStorage
from crewai.utilities.constants import KNOWLEDGE_DIRECTORY
class BaseFileKnowledgeSource(BaseKnowledgeSource):
class BaseFileKnowledgeSource(BaseKnowledgeSource, ABC):
"""Base class for knowledge sources that load content from files."""
file_path: Union[Path, List[Path]] = Field(...)
_logger: Logger = Logger(verbose=True)
file_path: Union[Path, List[Path], str, List[str]] = Field(
..., description="The path to the file"
)
content: Dict[Path, str] = Field(init=False, default_factory=dict)
storage: KnowledgeStorage = Field(default_factory=KnowledgeStorage)
safe_file_paths: List[Path] = Field(default_factory=list)
def model_post_init(self, _):
"""Post-initialization method to load content."""
self.safe_file_paths = self._process_file_paths()
self.validate_paths()
self.content = self.load_content()
@abstractmethod
def load_content(self) -> Dict[Path, str]:
"""Load and preprocess file content. Should be overridden by subclasses."""
paths = [self.file_path] if isinstance(self.file_path, Path) else self.file_path
"""Load and preprocess file content. Should be overridden by subclasses. Assume that the file path is relative to the project root in the knowledge directory."""
pass
for path in paths:
def validate_paths(self):
"""Validate the paths."""
for path in self.safe_file_paths:
if not path.exists():
self._logger.log(
"error",
f"File not found: {path}. Try adding sources to the knowledge directory. If its inside the knowledge directory, use the relative path.",
color="red",
)
raise FileNotFoundError(f"File not found: {path}")
if not path.is_file():
raise ValueError(f"Path is not a file: {path}")
return {}
self._logger.log(
"error",
f"Path is not a file: {path}",
color="red",
)
def save_documents(self, metadata: Dict[str, Any]):
"""Save the documents to the storage."""
chunk_metadatas = [metadata.copy() for _ in self.chunks]
self.storage.save(self.chunks, chunk_metadatas)
def convert_to_path(self, path: Union[Path, str]) -> Path:
"""Convert a path to a Path object."""
return Path(KNOWLEDGE_DIRECTORY + "/" + path) if isinstance(path, str) else path
def _process_file_paths(self) -> List[Path]:
"""Convert file_path to a list of Path objects."""
paths = (
[self.file_path]
if isinstance(self.file_path, (str, Path))
else self.file_path
)
if not isinstance(paths, list):
raise ValueError("file_path must be a Path, str, or a list of these types")
return [self.convert_to_path(path) for path in paths]

View File

@@ -10,19 +10,15 @@ class CSVKnowledgeSource(BaseFileKnowledgeSource):
def load_content(self) -> Dict[Path, str]:
"""Load and preprocess CSV file content."""
super().load_content() # Validate the file path
file_path = (
self.file_path[0] if isinstance(self.file_path, list) else self.file_path
)
file_path = Path(file_path) if isinstance(file_path, str) else file_path
with open(file_path, "r", encoding="utf-8") as csvfile:
reader = csv.reader(csvfile)
content = ""
for row in reader:
content += " ".join(row) + "\n"
return {file_path: content}
content_dict = {}
for file_path in self.safe_file_paths:
with open(file_path, "r", encoding="utf-8") as csvfile:
reader = csv.reader(csvfile)
content = ""
for row in reader:
content += " ".join(row) + "\n"
content_dict[file_path] = content
return content_dict
def add(self) -> None:
"""

View File

@@ -8,17 +8,15 @@ class ExcelKnowledgeSource(BaseFileKnowledgeSource):
def load_content(self) -> Dict[Path, str]:
"""Load and preprocess Excel file content."""
super().load_content() # Validate the file path
pd = self._import_dependencies()
if isinstance(self.file_path, list):
file_path = self.file_path[0]
else:
file_path = self.file_path
df = pd.read_excel(file_path)
content = df.to_csv(index=False)
return {file_path: content}
content_dict = {}
for file_path in self.safe_file_paths:
file_path = self.convert_to_path(file_path)
df = pd.read_excel(file_path)
content = df.to_csv(index=False)
content_dict[file_path] = content
return content_dict
def _import_dependencies(self):
"""Dynamically import dependencies."""

View File

@@ -10,11 +10,9 @@ class JSONKnowledgeSource(BaseFileKnowledgeSource):
def load_content(self) -> Dict[Path, str]:
"""Load and preprocess JSON file content."""
super().load_content() # Validate the file path
paths = [self.file_path] if isinstance(self.file_path, Path) else self.file_path
content: Dict[Path, str] = {}
for path in paths:
for path in self.safe_file_paths:
path = self.convert_to_path(path)
with open(path, "r", encoding="utf-8") as json_file:
data = json.load(json_file)
content[path] = self._json_to_text(data)

View File

@@ -9,14 +9,13 @@ class PDFKnowledgeSource(BaseFileKnowledgeSource):
def load_content(self) -> Dict[Path, str]:
"""Load and preprocess PDF file content."""
super().load_content() # Validate the file paths
pdfplumber = self._import_pdfplumber()
paths = [self.file_path] if isinstance(self.file_path, Path) else self.file_path
content = {}
for path in paths:
for path in self.safe_file_paths:
text = ""
path = self.convert_to_path(path)
with pdfplumber.open(path) as pdf:
for page in pdf.pages:
page_text = page.extract_text()

View File

@@ -9,12 +9,11 @@ class TextFileKnowledgeSource(BaseFileKnowledgeSource):
def load_content(self) -> Dict[Path, str]:
"""Load and preprocess text file content."""
super().load_content()
paths = [self.file_path] if isinstance(self.file_path, Path) else self.file_path
content = {}
for path in paths:
with path.open("r", encoding="utf-8") as f:
content[path] = f.read() # type: ignore
for path in self.safe_file_paths:
path = self.convert_to_path(path)
with open(path, "r", encoding="utf-8") as f:
content[path] = f.read()
return content
def add(self) -> None:

View File

@@ -1,3 +1,4 @@
TRAINING_DATA_FILE = "training_data.pkl"
TRAINED_AGENTS_DATA_FILE = "trained_agents_data.pkl"
DEFAULT_SCORE_THRESHOLD = 0.35
KNOWLEDGE_DIRECTORY = "knowledge"