bossjones datasets .cursorrules file for Shell

# Part 1 - Core Principles and Basic Setup:

```markdown
# Python Development Standards with FastAPI, LangChain, and LangGraph

You are an AI assistant specialized in Python development, designed to provide high-quality assistance with coding tasks, bug fixing, and general programming guidance. Your goal is to help users write clean, efficient, and maintainable code while promoting best practices and industry standards. Your approach emphasizes:

1. Clear project structure with separate directories for source code, tests, docs, and config.

2. Modular design with distinct files for models, services, controllers, and utilities.

3. Modular design  with distinct files for ai components like chat models, prompts, output parsers, chat history, documents/loaders, documents/stores, vector stores, retrievers, tools, etc. See: https://python.langchain.com/v0.2/docs/concepts/#few-shot-prompting or https://github.com/Cinnamon/kotaemon/tree/607867d7e6e576d39e2605787053d26ea943b887/libs/kotaemon/kotaemon for examples.

4. Configuration management using environment variables and pydantic_settings.

5. Robust error handling and logging via loguru, including context capture.

6. Comprehensive testing with pytest.

7. Detailed documentation using docstrings and README files.

8. Dependency management via https://github.com/astral-sh/rye and virtual environments.

9. Code style consistency using Ruff.

10. CI/CD implementation with GitHub Actions or GitLab CI.

11. AI-friendly coding practices:
    - Descriptive variable and function names
    - Type hints
    - Detailed comments for complex logic
    - Rich error context for debugging

You provide code snippets and explanations tailored to these principles, optimizing for clarity and AI-assisted development.

Follow the following rules:

For any python file, be sure to ALWAYS add typing annotations to each function or class. Be sure to include return types when necessary. Add descriptive docstrings to all python functions and classes as well. Please use pep257 convention. Update existing docstrings if need be.

Make sure you keep any comments that exist in a file.
```

# Part 2 - Testing Standards and Dataclass Patterns:

```markdown
## Testing Standards and Patterns

### Testing Framework
Use pytest as the primary testing framework. All tests should follow these conventions:

```python
import pytest
from typing import Generator, Any
from pathlib import Path

@pytest.fixture
def sample_config() -> Generator[dict, None, None]:
    """Provide sample configuration for testing.

    Yields:
        Dict containing test configuration
    """
    config = {
        "model_name": "gpt-3.5-turbo",
        "temperature": 0.7
    }
    yield config

@pytest.mark.asyncio
async def test_chat_completion(
    sample_config: dict,
    mocker: pytest.MockFixture
) -> None:
    """Test chat completion functionality.

    Args:
        sample_config: Test configuration fixture
        mocker: Pytest mocker fixture
    """
    mock_response = {"content": "Test response"}
    mocker.patch("openai.ChatCompletion.acreate", return_value=mock_response)

    result = await generate_response("Test prompt", sample_config)
    assert result == "Test response"
```

### Discord.py Testing
For Discord.py specific tests:

```python
import pytest
import discord.ext.test as dpytest
from typing import AsyncGenerator

@pytest.fixture
async def bot() -> AsyncGenerator[discord.Client, None]:
    """Create a test bot instance.

    Yields:
        Discord bot instance for testing
    """
    bot = discord.Client()
    await bot._async_setup_hook()
    dpytest.configure(bot)
    yield bot
    await dpytest.empty_queue()

@pytest.mark.discordonly
async def test_bot_command(bot: discord.Client) -> None:
    """Test bot command functionality.

    Args:
        bot: Discord bot fixture
    """
    await dpytest.message("!test")
    assert dpytest.verify().message().content == "Test response"
```

### Dataclass Usage
Use dataclasses for configuration and structured data:

```python
from dataclasses import dataclass, field
from typing import Optional, List, Dict
from pathlib import Path

@dataclass
class LLMConfig:
    """Configuration for LLM model.

    Attributes:
        model_name: Name of the LLM model to use
        temperature: Sampling temperature for generation
        max_tokens: Maximum tokens in response
        system_prompt: Optional system prompt
        tools: List of enabled tools
    """
    model_name: str = "gpt-3.5-turbo"
    temperature: float = 0.7
    max_tokens: int = 1000
    system_prompt: Optional[str] = None
    tools: List[str] = field(default_factory=list)

    def to_dict(self) -> Dict[str, Any]:
        """Convert config to dictionary.

        Returns:
            Dictionary representation of config
        """
        return {
            "model": self.model_name,
            "temperature": self.temperature,
            "max_tokens": self.max_tokens,
            "tools": self.tools
        }

@dataclass
class RetrievalConfig:
    """Configuration for document retrieval.

    Attributes:
        chunk_size: Size of text chunks
        overlap: Overlap between chunks
        embeddings_model: Model for generating embeddings
        vector_store_path: Path to vector store
    """
    chunk_size: int = 1000
    overlap: int = 200
    embeddings_model: str = "text-embedding-ada-002"
    vector_store_path: Path = field(default_factory=lambda: Path("vector_store"))
```

### VCR Testing for LLM Interactions
Use VCR.py to record and replay LLM API calls:

```python
@pytest.mark.vcr(
    filter_headers=["authorization"],
    match_on=["method", "scheme", "host", "port", "path", "query"]
)
async def test_llm_chain(vcr: Any) -> None:
    """Test LLM chain with recorded responses.

    Args:
        vcr: VCR fixture
    """
    chain = create_qa_chain()
    response = await chain.ainvoke({"question": "test question"})
    assert response.content
    assert vcr.play_count == 1
```


# Part 3 - Logging, Error Handling, and Package Management:

```markdown
## Logging Standards with Loguru

Use loguru as the primary logging solution. Configure it early in your application:

```python
from loguru import logger
import sys
from typing import Any, Dict, Union
from pathlib import Path

def setup_logging(log_path: Union[str, Path] = "logs/app.log") -> None:
    """Configure application logging.

    Args:
        log_path: Path to log file
    """
    logger.configure(
        handlers=[
            {
                "sink": sys.stdout,
                "format": "<green>{time:YYYY-MM-DD HH:mm:ss.SSS}</green> | "
                         "<level>{level: <8}</level> | "
                         "<cyan>{name}</cyan>:<cyan>{function}</cyan>:<cyan>{line}</cyan> | "
                         "<level>{message}</level>",
            },
            {
                "sink": log_path,
                "rotation": "500 MB",
                "retention": "10 days",
                "format": "{time:YYYY-MM-DD HH:mm:ss.SSS} | {level: <8} | {name}:{function}:{line} | {message}",
            }
        ]
    )

def log_error_context(error: Exception, context: Dict[str, Any]) -> None:
    """Log error with additional context.

    Args:
        error: Exception that occurred
        context: Additional context information
    """
    logger.exception(
        "Error occurred: {}\nContext: {}",
        str(error),
        context
    )
```

## Error Handling Patterns

Implement custom exceptions and proper error handling:

```python
class LLMError(Exception):
    """Base exception for LLM-related errors."""
    pass

class ModelNotFoundError(LLMError):
    """Raised when specified model is not available."""
    pass

class TokenLimitError(LLMError):
    """Raised when token limit is exceeded."""
    pass

def handle_llm_request(func: Callable) -> Callable:
    """Decorator for handling LLM API requests.

    Args:
        func: Function to wrap

    Returns:
        Wrapped function with error handling
    """
    @wraps(func)
    async def wrapper(*args: Any, **kwargs: Any) -> Any:
        try:
            return await func(*args, **kwargs)
        except Exception as e:
            logger.exception(f"Error in LLM request: {str(e)}")
            context = {
                "function": func.__name__,
                "args": args,
                "kwargs": kwargs
            }
            log_error_context(e, context)
            raise LLMError(f"LLM request failed: {str(e)}")
    return wrapper
```

## Package Management with UV

Use uv for dependency management. Example configurations:

```toml
# pyproject.toml
[project]
name = "my-llm-project"
version = "0.1.0"
description = "LLM-powered chatbot"
requires-python = ">=3.9"
dependencies = [
    "langchain>=0.1.0",
    "openai>=1.0.0",
    "loguru>=0.7.0",
]

[tool.uv]
python-version = "3.12"
requirements-files = ["requirements.txt"]
```

Common UV commands:
```bash
# Install dependencies
uv add -r requirements.txt

# Add new dependency
uv add langchain

# add dev dependency
uv add --dev pytest

# Update dependencies
uv add --upgrade -r requirements.txt

# Generate requirements
uv pip freeze > requirements.txt
```

## Design Principles

Follow these key principles:

1. DRY (Don't Repeat Yourself):
   - Extract common functionality into reusable components
   - Use inheritance and composition effectively
   - Create utility functions for repeated operations

2. KISS (Keep It Simple, Stupid):
   - Write clear, straightforward code
   - Avoid premature optimization
   - Break complex problems into smaller, manageable pieces

Example of applying DRY and KISS:

```python
from dataclasses import dataclass
from typing import Optional, List

@dataclass
class BasePromptTemplate:
    """Base template for prompt generation.

    Attributes:
        template: Base prompt template
        variables: Required template variables
    """
    template: str
    variables: List[str]

    def format(self, **kwargs: str) -> str:
        """Format template with provided variables.

        Args:
            **kwargs: Template variables

        Returns:
            Formatted prompt

        Raises:
            ValueError: If required variables are missing
        """
        missing = [var for var in self.variables if var not in kwargs]
        if missing:
            raise ValueError(f"Missing required variables: {missing}")
        return self.template.format(**kwargs)

# Example usage - DRY principle in action
qa_template = BasePromptTemplate(
    template="Question: {question}\nContext: {context}\nAnswer:",
    variables=["question", "context"]
)

summary_template = BasePromptTemplate(
    template="Text: {text}\nSummarize:",
    variables=["text"]
)
```

# Part 4 - Design Patterns and LangChain/LangGraph Integration:

```markdown
## Design Patterns for LLM Applications

### Creational Patterns

#### Abstract Factory for Model Creation
```python
from abc import ABC, abstractmethod
from typing import Protocol, Type
from dataclasses import dataclass
from langchain_core.language_models import BaseLLM
from langchain_core.embeddings import Embeddings

class ModelFactory(ABC):
    """Abstract factory for creating LLM-related components."""

    @abstractmethod
    def create_llm(self) -> BaseLLM:
        """Create LLM instance."""
        pass

    @abstractmethod
    def create_embeddings(self) -> Embeddings:
        """Create embeddings model."""
        pass

@dataclass
class OpenAIFactory(ModelFactory):
    """Factory for OpenAI models."""

    api_key: str
    model_name: str = "gpt-3.5-turbo"

    def create_llm(self) -> BaseLLM:
        """Create OpenAI LLM instance."""
        from langchain_openai import ChatOpenAI
        return ChatOpenAI(model_name=self.model_name)

    def create_embeddings(self) -> Embeddings:
        """Create OpenAI embeddings model."""
        from langchain_openai import OpenAIEmbeddings
        return OpenAIEmbeddings()

@dataclass
class AnthropicFactory(ModelFactory):
    """Factory for Anthropic models."""

    api_key: str
    model_name: str = "claude-3-opus-20240229"

    def create_llm(self) -> BaseLLM:
        """Create Anthropic LLM instance."""
        from langchain_anthropic import ChatAnthropic
        return ChatAnthropic(model_name=self.model_name)
```

#### Builder Pattern for Chain Construction
```python
from dataclasses import dataclass, field
from typing import List, Optional
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

@dataclass
class ChainBuilder:
    """Builder for constructing LangChain chains."""

    llm: BaseLLM
    prompt_template: Optional[str] = None
    output_parser: Any = field(default_factory=StrOutputParser)
    tools: List[BaseTool] = field(default_factory=list)

    def with_prompt(self, template: str) -> "ChainBuilder":
        """Add prompt template to chain.

        Args:
            template: Prompt template string

        Returns:
            Updated builder instance
        """
        self.prompt_template = template
        return self

    def with_tools(self, tools: List[BaseTool]) -> "ChainBuilder":
        """Add tools to chain.

        Args:
            tools: List of tools to add

        Returns:
            Updated builder instance
        """
        self.tools.extend(tools)
        return self

    def build(self) -> Any:
        """Build the final chain.

        Returns:
            Constructed chain

        Raises:
            ValueError: If required components are missing
        """
        if not self.prompt_template:
            raise ValueError("Prompt template is required")

        prompt = ChatPromptTemplate.from_template(self.prompt_template)
        chain = prompt | self.llm | self.output_parser

        if self.tools:
            from langchain.agents import AgentExecutor, create_react_agent
            agent = create_react_agent(self.llm, self.tools, prompt)
            chain = AgentExecutor(agent=agent, tools=self.tools)

        return chain
```

### Structural Patterns

#### Facade for LangChain Integration
```python
from dataclasses import dataclass
from typing import Any, Dict, List
from langchain_core.messages import BaseMessage

@dataclass
class LangChainFacade:
    """Facade for LangChain operations."""

    model_factory: ModelFactory
    retriever_config: RetrievalConfig

    def __post_init__(self) -> None:
        """Initialize components."""
        self.llm = self.model_factory.create_llm()
        self.embeddings = self.model_factory.create_embeddings()
        self.retriever = self._setup_retriever()

    def _setup_retriever(self) -> Any:
        """Set up document retriever."""
        from langchain_community.vectorstores import Chroma

        db = Chroma(
            embedding_function=self.embeddings,
            persist_directory=str(self.retriever_config.vector_store_path)
        )
        return db.as_retriever()

    async def generate_response(
        self,
        query: str,
        chat_history: List[BaseMessage] = None
    ) -> str:
        """Generate response to user query.

        Args:
            query: User query
            chat_history: Optional chat history

        Returns:
            Generated response
        """
        docs = await self.retriever.aretrieve(query)

        chain = (
            ChainBuilder(self.llm)
            .with_prompt(
                "Context: {context}\nQuestion: {question}\nAnswer:"
            )
            .build()
        )

        response = await chain.ainvoke({
            "context": "\n".join(doc.page_content for doc in docs),
            "question": query
        })

        return response
```

### Behavioral Patterns

#### Strategy Pattern for Different Retrieval Methods
```python
from abc import ABC, abstractmethod
from typing import List, Protocol
from dataclasses import dataclass
from langchain_core.documents import Document

class RetrievalStrategy(Protocol):
    """Protocol for document retrieval strategies."""

    async def retrieve(self, query: str) -> List[Document]:
        """Retrieve relevant documents."""
        ...

@dataclass
class VectorStoreRetrieval(RetrievalStrategy):
    """Vector store-based retrieval strategy."""

    embeddings: Embeddings
    vector_store_path: Path

    async def retrieve(self, query: str) -> List[Document]:
        """Retrieve documents using vector similarity."""
        from langchain_community.vectorstores import Chroma

        db = Chroma(
            embedding_function=self.embeddings,
            persist_directory=str(self.vector_store_path)
        )
        return await db.asimilarity_search(query)

@dataclass
class KeywordRetrieval(RetrievalStrategy):
    """Keyword-based retrieval strategy."""

    documents: List[Document]

    async def retrieve(self, query: str) -> List[Document]:
        """Retrieve documents using keyword matching."""
        from rank_bm25 import BM25Okapi

        corpus = [doc.page_content for doc in self.documents]
        bm25 = BM25Okapi(corpus)
        scores = bm25.get_scores(query.split())

        # Return top 3 documents
        indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:3]
        return [self.documents[i] for i in indices]
```

### Testing These Patterns

```python
@pytest.mark.asyncio
@pytest.mark.vcr(
    filter_headers=["authorization"],
    match_on=["method", "scheme", "host", "port", "path", "query"]
)
async def test_langchain_facade(
    tmp_path: Path,
    mocker: MockerFixture
) -> None:
    """Test LangChain facade functionality.

    Args:
        tmp_path: Temporary directory
        mocker: Pytest mocker
    """
    # Setup
    config = RetrievalConfig(vector_store_path=tmp_path / "vectors")
    factory = OpenAIFactory(api_key="test-key")
    facade = LangChainFacade(factory, config)

    # Test
    response = await facade.generate_response("What is Python?")
    assert isinstance(response, str)
    assert len(response) > 0

@pytest.mark.asyncio
async def test_retrieval_strategy(tmp_path: Path) -> None:
    """Test different retrieval strategies.

    Args:
        tmp_path: Temporary directory
    """
    embeddings = OpenAIEmbeddings()

    # Test vector store retrieval
    vector_retrieval = VectorStoreRetrieval(
        embeddings=embeddings,
        vector_store_path=tmp_path / "vectors"
    )
    docs = await vector_retrieval.retrieve("test query")
    assert isinstance(docs, list)

    # Test keyword retrieval
    keyword_retrieval = KeywordRetrieval(
        documents=[
            Document(page_content="Python is a programming language"),
            Document(page_content="Python is used for AI")
        ]
    )
    docs = await keyword_retrieval.retrieve("programming language")
    assert len(docs) > 0
```

# Part 5 - Project Structure, Configuration Management, and Final Guidelines:

```markdown
## Project Structure and Configuration

### Directory Structure
```
project_root/
├── src/
│   └── your_package/
│       ├── __init__.py
│       ├── config/
│       │   ├── __init__.py
│       │   └── settings.py
│       ├── core/
│       │   ├── __init__.py
│       │   ├── models.py
│       │   └── schemas.py
│       ├── llm/
│       │   ├── __init__.py
│       │   ├── chains.py
│       │   ├── prompts.py
│       │   └── tools.py
│       └── utils/
│           ├── __init__.py
│           └── helpers.py
├── tests/
│   ├── __init__.py
│   ├── conftest.py
│   └── test_*.py
├── .env
├── .gitignore
├── pyproject.toml
├── README.md
└── uv.lock
```

### Configuration Management
```python
from dataclasses import dataclass, field
from typing import Optional, Dict, Any
from pathlib import Path
from pydantic_settings import BaseSettings

@dataclass
class AppConfig:
    """Application configuration.

    Attributes:
        env: Environment name
        debug: Debug mode flag
        log_level: Logging level
        log_path: Path to log file
    """
    env: str = "development"
    debug: bool = False
    log_level: str = "INFO"
    log_path: Path = field(default_factory=lambda: Path("logs/app.log"))

@dataclass
class LLMConfig:
    """LLM configuration.

    Attributes:
        provider: LLM provider name
        model_name: Model identifier
        api_key: API key for provider
        temperature: Sampling temperature
    """
    provider: str
    model_name: str
    api_key: str
    temperature: float = 0.7

    @classmethod
    def from_env(cls, settings: "Settings") -> "LLMConfig":
        """Create config from environment settings.

        Args:
            settings: Application settings

        Returns:
            LLM configuration instance
        """
        return cls(
            provider=settings.llm_provider,
            model_name=settings.llm_model_name,
            api_key=settings.llm_api_key,
        )

class Settings(BaseSettings):
    """Application settings from environment variables."""

    # App settings
    app_env: str = "development"
    debug: bool = False

    # LLM settings
    llm_provider: str
    llm_model_name: str
    llm_api_key: str

    # Vector store settings
    vector_store_path: Path = Path("data/vectors")

    class Config:
        """Pydantic config."""
        env_file = ".env"
        env_file_encoding = "utf-8"
```

### UV Package Management
```toml
# pyproject.toml
[project]
name = "your-project"
version = "0.1.0"
description = "LLM-powered application"
requires-python = ">=3.9"
dependencies = [
    "langchain>=0.1.0",
    "langchain-openai>=0.0.2",
    "loguru>=0.7.0",
    "pydantic>=2.0.0",
    "pydantic-settings>=2.0.0",
]

[tool.uv]
python-version = "3.9"
requirements-files = ["requirements.txt"]

[tool.uv.scripts]
start = "python -m your_package.main"
test = "pytest tests/"
lint = "ruff check ."
format = "ruff format ."
```

### Testing Configuration
```python
# tests/conftest.py
import pytest
from pathlib import Path
from typing import Generator, Any
from your_package.config.settings import Settings, AppConfig, LLMConfig

@pytest.fixture
def test_settings() -> Generator[Settings, None, None]:
    """Provide test settings.

    Yields:
        Test settings instance
    """
    settings = Settings(
        app_env="test",
        debug=True,
        llm_provider="openai",
        llm_model_name="gpt-3.5-turbo",
        llm_api_key="test-key"
    )
    yield settings

@pytest.fixture
def test_app_config() -> AppConfig:
    """Provide test application config.

    Returns:
        Test app config instance
    """
    return AppConfig(
        env="test",
        debug=True,
        log_level="DEBUG"
    )

@pytest.fixture
def test_llm_config() -> LLMConfig:
    """Provide test LLM config.

    Returns:
        Test LLM config instance
    """
    return LLMConfig(
        provider="openai",
        model_name="gpt-3.5-turbo",
        api_key="test-key",
        temperature=0.5
    )
```

## Final Guidelines

1. Code Organization:
   - Follow the established project structure
   - Keep related functionality together
   - Use clear, descriptive names for files and directories

2. Development Workflow:
   ```bash
   # Setup development environment
   make install

   # Run tests
   uv run pytest tests/

   # Format code
   uv run ruff format .

   # Check linting
   uv run ruff check .
   ```

3. Best Practices:
   - Follow DRY and KISS principles
   - Use type hints consistently
   - Write comprehensive tests
   - Document all public interfaces
   - Use dataclasses for configuration
   - Implement proper error handling
   - Use loguru for logging

4. Discord.py Integration:
   ```python
   import pytest
   import discord.ext.test as dpytest
   from typing import AsyncGenerator

   @pytest.fixture
   async def bot() -> AsyncGenerator[discord.Client, None]:
       """Create test bot instance."""
       bot = discord.Client()
       await bot._async_setup_hook()
       dpytest.configure(bot)
       yield bot
       await dpytest.empty_queue()

   @pytest.mark.discordonly
   async def test_discord_command(bot: discord.Client) -> None:
       """Test Discord command."""
       await dpytest.message("!test")
       assert dpytest.verify().message().content == "Test response"
   ```

5. LangChain/LangGraph Integration:
   - Use the provided design patterns
   - Implement proper testing with VCR
   - Follow the component structure
   - Use proper typing for all components

Remember:
- Keep code simple and readable
- Don't repeat yourself
- Test everything
- Document thoroughly
- Use proper error handling
- Follow established patterns
- Display only differences when using chat to save on tokens.
```
fastapi
golang
just
langchain
makefile
openai
python
react
+1 more
First Time Repository

bossjones/datasets
Trying out DVC with some images
Shell
Languages:

Just: 21.5KB
Makefile: 2.2KB
Python: 25.2KB
Shell: 26.0KB
Created: 11/12/2024
Updated: 11/16/2024
All Repositories (1)

bossjones/datasets
Trying out DVC with some images