ScrapeGoatMusic-s2-1B-general / README.md

Update README.md

b22b6f1 verified 4 months ago

8.2 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- music
	- text-generation
	- transformers
	pipeline_tag: text-generation
	library_name: transformers
	---

	# Stage 2 Model

	# ScrapeGoatMusic Generation API

	A music generation system powered by ScrapeGoatMusic, optimized for NVIDIA H100 GPUs with FastAPI integration.

	## System Requirements

	- NVIDIA H100 GPU
	- CUDA 12.0 or higher
	- Python 3.8
	- 32GB+ RAM
	- Ubuntu 22.04 LTS or higher

	## Installation

	1. Create and activate a conda environment:
	```bash
	conda create -n ScrapeGoatMusic python=3.8
	conda activate ScrapeGoatMusic
	```

	2. Install PyTorch with CUDA support:
	```bash
	conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
	```

	3. Install dependencies:
	```bash
	pip install descript-audio-codec
	pip install npy_append_array soundfile
	pip install fastapi uvicorn python-multipart
	pip install flash-attn --no-build-isolation
	```

	4. Clone and install RepCodec:
	```bash
	cd inference/xcodec_mini_infer
	git clone https://github.com/mct10/RepCodec.git
	cd RepCodec
	pip install .
	```

	5. Download required model files:
	```bash
	# Download models from Hugging Face
	git lfs install
	cd inference
	git clone https://huggingface.co/Nathan9/xcodec_mini_infer
	```

	## API Setup

	1. Create a new file `api.py`:
	```python
	from fastapi import FastAPI, UploadFile, File, Form
	from fastapi.responses import FileResponse
	import uvicorn
	import torch
	import os
	import argparse
	from pathlib import Path
	import uuid
	from typing import Optional

	app = FastAPI(title="ScrapeGoatMusic Generation API")

	# Initialize models and configurations
	def init_models():
	parser = argparse.ArgumentParser()
	# Add all your existing arguments here
	args = parser.parse_args([])
	args.stage1_model = "Nathan9/ScrapeGoatMusic-s1-7B-anneal-en-cot"
	args.stage2_model = "Nathan9/ScrapeGoatMusic-s2-1B-general"
	args.max_new_tokens = 3000
	args.run_n_segments = 2
	args.stage2_batch_size = 4
	args.output_dir = "./output"
	args.cuda_idx = 0
	# Add other default arguments
	return args

	@app.on_event("startup")
	async def startup_event():
	global args
	args = init_models()
	os.makedirs(args.output_dir, exist_ok=True)

	@app.post("/generate")
	async def generate_music(
	genre_file: UploadFile = File(...),
	lyrics_file: UploadFile = File(...),
	audio_prompt: Optional[UploadFile] = File(None),
	prompt_start_time: float = Form(0.0),
	prompt_end_time: float = Form(30.0)
	):
	# Create unique session ID
	session_id = str(uuid.uuid4())
	session_dir = Path(args.output_dir) / session_id
	os.makedirs(session_dir, exist_ok=True)

	# Save uploaded files
	genre_path = session_dir / "genre.txt"
	lyrics_path = session_dir / "lyrics.txt"

	with open(genre_path, "wb") as f:
	f.write(await genre_file.read())
	with open(lyrics_path, "wb") as f:
	f.write(await lyrics_file.read())

	# Handle optional audio prompt
	audio_prompt_path = None
	if audio_prompt:
	audio_prompt_path = session_dir / "audio_prompt.wav"
	with open(audio_prompt_path, "wb") as f:
	f.write(await audio_prompt.read())

	# Run inference
	try:
	# Import your inference code here
	from infer import run_inference
	output_path = run_inference(
	args,
	str(genre_path),
	str(lyrics_path),
	str(audio_prompt_path) if audio_prompt_path else None,
	prompt_start_time,
	prompt_end_time
	)

	return FileResponse(
	output_path,
	media_type="audio/mpeg",
	filename=f"generated_music_{session_id}.mp3"
	)
	except Exception as e:
	return {"error": str(e)}

	if __name__ == "__main__":
	uvicorn.run(app, host="0.0.0.0", port=8000)
	```

	2. Create a new file `infer.py` with your existing inference code, modified to be imported as a module.

	## Running the API

	1. Start the API server:
	```bash
	python api.py
	```

	2. The API will be available at `http://localhost:8000`

	## API Endpoints

	### POST /generate
	Generates music based on provided genre and lyrics.

	Parameters:
	- `genre_file`: Text file containing genre tags (Required)
	- `lyrics_file`: Text file containing lyrics (Required)
	- `audio_prompt`: Audio file for prompt (Optional)
	- `prompt_start_time`: Start time for audio prompt (Default: 0.0)
	- `prompt_end_time`: End time for audio prompt (Default: 30.0)

	Example using curl:
	```bash
	curl -X POST "http://localhost:8000/generate" \
	-H "accept: application/json" \
	-H "Content-Type: multipart/form-data" \
	-F "genre_file=@/path/to/genre.txt" \
	-F "lyrics_file=@/path/to/lyrics.txt" \
	-F "prompt_start_time=0.0" \
	-F "prompt_end_time=30.0"
	```

	Example genre.txt format:
	```
	instrumental pop energetic female vocals
	```

	Example lyrics.txt format:
	```
	[verse]
	Your lyrics here
	[chorus]
	Your chorus here
	```

	## H100 Optimization

	1. Enable Flash Attention:
	```python
	model = AutoModelForCausalLM.from_pretrained(
	stage1_model,
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2"
	)
	```

	2. Optimize memory usage:
	```python
	# Add to your inference configuration
	torch.cuda.set_device(0) # Use first H100
	torch.backends.cudnn.benchmark = True
	```

	3. For multi-GPU setup, modify `cuda_idx` in the API configuration.

	## Monitoring

	The API includes Swagger documentation at `http://localhost:8000/docs` for testing and monitoring endpoints.

	## Troubleshooting

	1. CUDA Out of Memory:
	- Reduce `stage2_batch_size`
	- Adjust `max_new_tokens`
	- Use gradient checkpointing

	2. Audio Quality Issues:
	- Check input audio format (16kHz, mono)
	- Verify genre tags format
	- Ensure lyrics follow the correct structure

	## Training

	This model was created through a multi-stage training process optimized for music generation. You can further fine-tune the model on your own data using the following steps:

	### Data Preparation

	1. Prepare your training data using the provided script:
	```bash
	python prepare_training_data.py
	```

	The script expects the following directory structure:
	```
	training_data/
	├── audio_tracks/ # 16kHz mono WAV files
	├── lyrics/ # Corresponding lyrics files
	└── genres/ # Genre tag files
	```

	### Training Requirements

	- NVIDIA H100 GPU (recommended)
	- 32GB+ GPU memory
	- Training dataset with:
	- High-quality audio files (16kHz mono)
	- Aligned lyrics in structured format
	- Genre annotations
	- At least 10,000 samples recommended

	### Fine-tuning Steps

	1. Install additional training dependencies:
	```bash
	pip install accelerate datasets transformers
	```

	2. Prepare your configuration:
	```bash
	# For Stage 1 model (7B)
	export MODEL_PATH="Nathan9/ScrapeGoatMusic-s1-7B-anneal-en-cot"
	export OUTPUT_DIR="./fine_tuned_model_s1"

	# For Stage 2 model (1B)
	export MODEL_PATH="Nathan9/ScrapeGoatMusic-s2-1B-general"
	export OUTPUT_DIR="./fine_tuned_model_s2"
	```

	3. Start training:
	```bash
	python train.py \
	--model_name_or_path $MODEL_PATH \
	--output_dir $OUTPUT_DIR \
	--num_train_epochs 3 \
	--per_device_train_batch_size 4 \
	--gradient_accumulation_steps 4 \
	--learning_rate 1e-5 \
	--warmup_steps 500 \
	--logging_steps 100 \
	--save_steps 1000 \
	--evaluation_strategy steps \
	--load_best_model_at_end \
	--gradient_checkpointing true
	```

	### Training Tips

	1. Stage 1 Model:
	- Use larger batch sizes (8-16) for better convergence
	- Enable gradient checkpointing for memory efficiency
	- Start with a lower learning rate (1e-5)
	- Train for at least 3 epochs

	2. Stage 2 Model:
	- Use smaller batch sizes (4-8)
	- Higher learning rate possible (2e-5)
	- Shorter training time needed
	- Focus on audio quality metrics

	3. Monitoring:
	- Use Weights & Biases for training visualization
	- Monitor loss curves for convergence
	- Validate generation quality periodically
	- Check for overfit on validation set

	4. Performance Optimization:
	- Enable Flash Attention during training
	- Use mixed precision training (bf16)
	- Distribute training across multiple GPUs if available
	- Implement proper gradient clipping

	## License

	FULL ACCESS, ENJOY