transformers
transformersnlppythonhuggingfacemlflow
Description
This rule set enforces best practices for developing with the transformers library, covering code organization, performance, security, and testing to promote maintainable and efficient NLP applications.
Globs
**/*.py
---
description: This rule set enforces best practices for developing with the transformers library, covering code organization, performance, security, and testing to promote maintainable and efficient NLP applications.
globs: **/*.py
---
- **Environment Management:**
- Use Conda or UV to create isolated environments for consistent dependencies across projects.
- Example (Conda): `conda create -n myenv python=3.12`
- Example (UV): `uv venv`
- Use `environment.yml` or `requirements.txt` to manage dependencies.
- Always use Python 3.12.
- **Code Organization and Structure:**
- **Directory Structure:** Adopt a modular structure for maintainability.
- `src/`: Source code.
- `models/`: Transformer model definitions.
- `data/`: Data loading and preprocessing.
- `training/`: Training scripts and utilities.
- `utils/`: Utility functions.
- `tests/`: Unit and integration tests.
- `notebooks/`: Experimentation and exploration notebooks.
- `scripts/`: Deployment and utility scripts.
- `config/`: Configuration files.
- **File Naming Conventions:**
- Use descriptive names: `model.py`, `data_loader.py`, `trainer.py`.
- Follow PEP 8 guidelines.
- **Module Organization:**
- Group related functions and classes into modules.
- Use clear and concise module names.
- **Component Architecture:**
- Implement modular components with well-defined interfaces.
- Use classes for stateful components, such as data loaders or model wrappers.
- **Code Splitting:**
- Split large files into smaller, manageable modules.
- Use lazy loading for large models or datasets to improve startup time.
- **Model Training and Evaluation:**
- Implement a structured approach for training and evaluating models.
- Use Hugging Face Transformers for easy access to pre-trained models.
- Log experiments using MLflow or TensorBoard for tracking model performance and versioning.
- Create clear separation between training, validation, and test datasets to prevent data leakage.
- Use consistent evaluation metrics.
- **Data Handling:**
- Preprocess data effectively using libraries like Hugging Face's tokenizers.
- Ensure proper tokenization and encoding.
- Manage large datasets efficiently with data loaders.
- Implement data validation to ensure data quality.
- **Code Structure:**
- Organize code into reusable modules and functions.
- Follow a consistent naming convention and documentation style (PEP 8) for enhanced readability and collaboration.
- Always use classes instead of functions where appropriate to encapsulate state and behavior.
- Add docstrings to all functions, classes, and modules.
- **Common Patterns and Anti-patterns:**
- **Design Patterns:**
- Use the Factory pattern for creating different model architectures.
- Use the Strategy pattern for different training strategies.
- Use the Decorator pattern for adding functionality to models.
- **Recommended Approaches:**
- Use pipelines for common tasks like text classification or question answering.
- Leverage pre-trained models for transfer learning.
- **Anti-patterns:**
- Avoid hardcoding configurations.
- Avoid global variables.
- Avoid deeply nested code.
- **State Management:**
- Encapsulate state within classes.
- Use configuration files to manage hyperparameters.
- **Error Handling:**
- Implement try-except blocks for error handling.
- Log errors and warnings using the `logging` module.
- Raise informative exceptions.
- **Performance Considerations:**
- Use appropriate batch sizes to optimize GPU utilization.
- Utilize mixed-precision training (FP16) for faster training and reduced memory consumption.
- Cache intermediate results to avoid redundant computations.
- Profile code using tools like `cProfile` to identify bottlenecks.
- Use optimized libraries like `torch.compile` when available.
- **Memory Management:**
- Release unused memory using `torch.cuda.empty_cache()`.
- Use data loaders with `num_workers` to parallelize data loading.
- **Bundle Size Optimization:**
- Remove unused dependencies.
- Use code minification and compression.
- **Lazy Loading:**
- Load models and datasets only when needed.
- Use `torch.jit.script` to compile model for inference.
- **Security Best Practices:**
- **Common Vulnerabilities:**
- Input injection attacks.
- Model poisoning attacks.
- **Input Validation:**
- Validate input data to prevent injection attacks.
- Sanitize user input before feeding it to the model.
- **Authentication and Authorization:**
- Implement authentication and authorization for API endpoints.
- Use secure protocols like HTTPS for communication.
- **Data Protection:**
- Encrypt sensitive data at rest and in transit.
- Use appropriate access controls to protect data.
- **Secure API Communication:**
- Use API keys or tokens for authentication.
- Implement rate limiting to prevent abuse.
- **Testing Approaches:**
- **Unit Testing:**
- Test individual components in isolation.
- Use mocking to isolate dependencies.
- Cover all code paths with unit tests.
- **Integration Testing:**
- Test interactions between different components.
- Verify that data flows correctly through the system.
- **End-to-End Testing:**
- Test the entire application from end to end.
- Simulate user interactions to verify functionality.
- **Test Organization:**
- Organize tests into separate directories.
- Use descriptive test names.
- **Mocking and Stubbing:**
- Use mocking frameworks like `unittest.mock` to isolate dependencies.
- Stub out external API calls to prevent network access.
- **Common Pitfalls and Gotchas:**
- **Frequent Mistakes:**
- Incorrectly configuring tokenizers.
- Using outdated versions of the library.
- Not handling edge cases in data preprocessing.
- **Edge Cases:**
- Handling long sequences.
- Dealing with out-of-vocabulary words.
- **Version-Specific Issues:**
- Check the release notes for breaking changes.
- Test code with different versions of the library.
- **Compatibility Concerns:**
- Ensure compatibility with different hardware and software configurations.
- Check compatibility with other libraries.
- **Debugging Strategies:**
- Use debuggers like `pdb` to step through code.
- Use logging to track program execution.
- **Tooling and Environment:**
- **Recommended Tools:**
- VS Code with Python extension.
- PyCharm.
- Jupyter Notebook.
- Debuggers: pdb, ipdb.
- **Build Configuration:**
- Use `setup.py` or `pyproject.toml` for build configuration.
- Specify dependencies in `requirements.txt` or `environment.yml`.
- **Linting and Formatting:**
- Use linters like `flake8` and `pylint` to enforce code style.
- Use formatters like `black` and `autopep8` to automatically format code.
- **Deployment:**
- Use Docker to containerize the application.
- Deploy to cloud platforms like AWS, Azure, or GCP.
- **CI/CD Integration:**
- Use CI/CD pipelines to automate testing and deployment.
- Integrate with version control systems like Git.
- **Additional Recommendations:**
- Always document your code thoroughly.
- Write clear and concise commit messages.
- Use version control (Git) to track changes.
- Participate in the transformers community to learn from others.
- Regularly update the library to benefit from bug fixes and new features.
- **References:**
- [Hugging Face Transformers Documentation](https://huggingface.co/transformers/)
- [PyTorch Documentation](https://pytorch.org/docs/stable/index.html)
- [MLflow Documentation](https://www.mlflow.org/docs/latest/index.html)
- [BERT Fine-Tuning Tutorial with PyTorch](http://www.mccormickml.com)