|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
tags: |
|
- music |
|
- text-generation |
|
- transformers |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
--- |
|
|
|
# Stage 2 Model |
|
|
|
# ScrapeGoatMusic Generation API |
|
|
|
A music generation system powered by ScrapeGoatMusic, optimized for NVIDIA H100 GPUs with FastAPI integration. |
|
|
|
## System Requirements |
|
|
|
- NVIDIA H100 GPU |
|
- CUDA 12.0 or higher |
|
- Python 3.8 |
|
- 32GB+ RAM |
|
- Ubuntu 22.04 LTS or higher |
|
|
|
## Installation |
|
|
|
1. Create and activate a conda environment: |
|
```bash |
|
conda create -n ScrapeGoatMusic python=3.8 |
|
conda activate ScrapeGoatMusic |
|
``` |
|
|
|
2. Install PyTorch with CUDA support: |
|
```bash |
|
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia |
|
``` |
|
|
|
3. Install dependencies: |
|
```bash |
|
pip install descript-audio-codec |
|
pip install npy_append_array soundfile |
|
pip install fastapi uvicorn python-multipart |
|
pip install flash-attn --no-build-isolation |
|
``` |
|
|
|
4. Clone and install RepCodec: |
|
```bash |
|
cd inference/xcodec_mini_infer |
|
git clone https://github.com/mct10/RepCodec.git |
|
cd RepCodec |
|
pip install . |
|
``` |
|
|
|
5. Download required model files: |
|
```bash |
|
# Download models from Hugging Face |
|
git lfs install |
|
cd inference |
|
git clone https://huggingface.co/Nathan9/xcodec_mini_infer |
|
``` |
|
|
|
## API Setup |
|
|
|
1. Create a new file `api.py`: |
|
```python |
|
from fastapi import FastAPI, UploadFile, File, Form |
|
from fastapi.responses import FileResponse |
|
import uvicorn |
|
import torch |
|
import os |
|
import argparse |
|
from pathlib import Path |
|
import uuid |
|
from typing import Optional |
|
|
|
app = FastAPI(title="ScrapeGoatMusic Generation API") |
|
|
|
# Initialize models and configurations |
|
def init_models(): |
|
parser = argparse.ArgumentParser() |
|
# Add all your existing arguments here |
|
args = parser.parse_args([]) |
|
args.stage1_model = "Nathan9/ScrapeGoatMusic-s1-7B-anneal-en-cot" |
|
args.stage2_model = "Nathan9/ScrapeGoatMusic-s2-1B-general" |
|
args.max_new_tokens = 3000 |
|
args.run_n_segments = 2 |
|
args.stage2_batch_size = 4 |
|
args.output_dir = "./output" |
|
args.cuda_idx = 0 |
|
# Add other default arguments |
|
return args |
|
|
|
@app.on_event("startup") |
|
async def startup_event(): |
|
global args |
|
args = init_models() |
|
os.makedirs(args.output_dir, exist_ok=True) |
|
|
|
@app.post("/generate") |
|
async def generate_music( |
|
genre_file: UploadFile = File(...), |
|
lyrics_file: UploadFile = File(...), |
|
audio_prompt: Optional[UploadFile] = File(None), |
|
prompt_start_time: float = Form(0.0), |
|
prompt_end_time: float = Form(30.0) |
|
): |
|
# Create unique session ID |
|
session_id = str(uuid.uuid4()) |
|
session_dir = Path(args.output_dir) / session_id |
|
os.makedirs(session_dir, exist_ok=True) |
|
|
|
# Save uploaded files |
|
genre_path = session_dir / "genre.txt" |
|
lyrics_path = session_dir / "lyrics.txt" |
|
|
|
with open(genre_path, "wb") as f: |
|
f.write(await genre_file.read()) |
|
with open(lyrics_path, "wb") as f: |
|
f.write(await lyrics_file.read()) |
|
|
|
# Handle optional audio prompt |
|
audio_prompt_path = None |
|
if audio_prompt: |
|
audio_prompt_path = session_dir / "audio_prompt.wav" |
|
with open(audio_prompt_path, "wb") as f: |
|
f.write(await audio_prompt.read()) |
|
|
|
# Run inference |
|
try: |
|
# Import your inference code here |
|
from infer import run_inference |
|
output_path = run_inference( |
|
args, |
|
str(genre_path), |
|
str(lyrics_path), |
|
str(audio_prompt_path) if audio_prompt_path else None, |
|
prompt_start_time, |
|
prompt_end_time |
|
) |
|
|
|
return FileResponse( |
|
output_path, |
|
media_type="audio/mpeg", |
|
filename=f"generated_music_{session_id}.mp3" |
|
) |
|
except Exception as e: |
|
return {"error": str(e)} |
|
|
|
if __name__ == "__main__": |
|
uvicorn.run(app, host="0.0.0.0", port=8000) |
|
``` |
|
|
|
2. Create a new file `infer.py` with your existing inference code, modified to be imported as a module. |
|
|
|
## Running the API |
|
|
|
1. Start the API server: |
|
```bash |
|
python api.py |
|
``` |
|
|
|
2. The API will be available at `http://localhost:8000` |
|
|
|
## API Endpoints |
|
|
|
### POST /generate |
|
Generates music based on provided genre and lyrics. |
|
|
|
**Parameters:** |
|
- `genre_file`: Text file containing genre tags (Required) |
|
- `lyrics_file`: Text file containing lyrics (Required) |
|
- `audio_prompt`: Audio file for prompt (Optional) |
|
- `prompt_start_time`: Start time for audio prompt (Default: 0.0) |
|
- `prompt_end_time`: End time for audio prompt (Default: 30.0) |
|
|
|
**Example using curl:** |
|
```bash |
|
curl -X POST "http://localhost:8000/generate" \ |
|
-H "accept: application/json" \ |
|
-H "Content-Type: multipart/form-data" \ |
|
-F "genre_file=@/path/to/genre.txt" \ |
|
-F "lyrics_file=@/path/to/lyrics.txt" \ |
|
-F "prompt_start_time=0.0" \ |
|
-F "prompt_end_time=30.0" |
|
``` |
|
|
|
**Example genre.txt format:** |
|
``` |
|
instrumental pop energetic female vocals |
|
``` |
|
|
|
**Example lyrics.txt format:** |
|
``` |
|
[verse] |
|
Your lyrics here |
|
[chorus] |
|
Your chorus here |
|
``` |
|
|
|
## H100 Optimization |
|
|
|
1. Enable Flash Attention: |
|
```python |
|
model = AutoModelForCausalLM.from_pretrained( |
|
stage1_model, |
|
torch_dtype=torch.bfloat16, |
|
attn_implementation="flash_attention_2" |
|
) |
|
``` |
|
|
|
2. Optimize memory usage: |
|
```python |
|
# Add to your inference configuration |
|
torch.cuda.set_device(0) # Use first H100 |
|
torch.backends.cudnn.benchmark = True |
|
``` |
|
|
|
3. For multi-GPU setup, modify `cuda_idx` in the API configuration. |
|
|
|
## Monitoring |
|
|
|
The API includes Swagger documentation at `http://localhost:8000/docs` for testing and monitoring endpoints. |
|
|
|
## Troubleshooting |
|
|
|
1. CUDA Out of Memory: |
|
- Reduce `stage2_batch_size` |
|
- Adjust `max_new_tokens` |
|
- Use gradient checkpointing |
|
|
|
2. Audio Quality Issues: |
|
- Check input audio format (16kHz, mono) |
|
- Verify genre tags format |
|
- Ensure lyrics follow the correct structure |
|
|
|
## Training |
|
|
|
This model was created through a multi-stage training process optimized for music generation. You can further fine-tune the model on your own data using the following steps: |
|
|
|
### Data Preparation |
|
|
|
1. Prepare your training data using the provided script: |
|
```bash |
|
python prepare_training_data.py |
|
``` |
|
|
|
The script expects the following directory structure: |
|
``` |
|
training_data/ |
|
βββ audio_tracks/ # 16kHz mono WAV files |
|
βββ lyrics/ # Corresponding lyrics files |
|
βββ genres/ # Genre tag files |
|
``` |
|
|
|
### Training Requirements |
|
|
|
- NVIDIA H100 GPU (recommended) |
|
- 32GB+ GPU memory |
|
- Training dataset with: |
|
- High-quality audio files (16kHz mono) |
|
- Aligned lyrics in structured format |
|
- Genre annotations |
|
- At least 10,000 samples recommended |
|
|
|
### Fine-tuning Steps |
|
|
|
1. Install additional training dependencies: |
|
```bash |
|
pip install accelerate datasets transformers |
|
``` |
|
|
|
2. Prepare your configuration: |
|
```bash |
|
# For Stage 1 model (7B) |
|
export MODEL_PATH="Nathan9/ScrapeGoatMusic-s1-7B-anneal-en-cot" |
|
export OUTPUT_DIR="./fine_tuned_model_s1" |
|
|
|
# For Stage 2 model (1B) |
|
export MODEL_PATH="Nathan9/ScrapeGoatMusic-s2-1B-general" |
|
export OUTPUT_DIR="./fine_tuned_model_s2" |
|
``` |
|
|
|
3. Start training: |
|
```bash |
|
python train.py \ |
|
--model_name_or_path $MODEL_PATH \ |
|
--output_dir $OUTPUT_DIR \ |
|
--num_train_epochs 3 \ |
|
--per_device_train_batch_size 4 \ |
|
--gradient_accumulation_steps 4 \ |
|
--learning_rate 1e-5 \ |
|
--warmup_steps 500 \ |
|
--logging_steps 100 \ |
|
--save_steps 1000 \ |
|
--evaluation_strategy steps \ |
|
--load_best_model_at_end \ |
|
--gradient_checkpointing true |
|
``` |
|
|
|
### Training Tips |
|
|
|
1. Stage 1 Model: |
|
- Use larger batch sizes (8-16) for better convergence |
|
- Enable gradient checkpointing for memory efficiency |
|
- Start with a lower learning rate (1e-5) |
|
- Train for at least 3 epochs |
|
|
|
2. Stage 2 Model: |
|
- Use smaller batch sizes (4-8) |
|
- Higher learning rate possible (2e-5) |
|
- Shorter training time needed |
|
- Focus on audio quality metrics |
|
|
|
3. Monitoring: |
|
- Use Weights & Biases for training visualization |
|
- Monitor loss curves for convergence |
|
- Validate generation quality periodically |
|
- Check for overfit on validation set |
|
|
|
4. Performance Optimization: |
|
- Enable Flash Attention during training |
|
- Use mixed precision training (bf16) |
|
- Distribute training across multiple GPUs if available |
|
- Implement proper gradient clipping |
|
|
|
## License |
|
|
|
FULL ACCESS, ENJOY |
|
|
|
|