Synthetic Data Generation (TUM Data Innovation Lab with PwC)last year
completedprivate
Academic project developed for the TUM Data Innovation Lab (Summer 2024) in collaboration with PwC. Implements synthetic data generation using two approaches: LLM-based GenAI methods (GPT-4, GPT-3.5, Gemini) and statistical methods (copulas, Bayesian networks, parametric models). Features a Streamlit UI for uploading statistical information and correlation matrices, with support for multiple generation backends and evaluation metrics.
// meta
role
Testing & Evaluation Lead
status
completed
collaborators
5
// highlights
>Implemented multiple synthetic data generation backends: GPT-4, GPT-3.5, Gemini LLMs, TinyLlama, Copula, and Parametric methods
>Built interactive Streamlit UI for statistical input upload and synthetic data generation with correlation matrix support
>Integrated DSPy compiled programs for optimized LLM-based generation with few-shot learning
>Developed evaluation framework comparing generated data statistics against input distributions and correlation preservation
>Implemented C-vine pair-copula construction for statistical data generation following academic methodology
>Created experiment tracking with Weights & Biases for benchmarking model performance across sample sizes
// stack
PythonDSPyOpenAI APIStreamlitWeights & BiasesTransformersPyTorchNumPyPandasGoogle Generative AISciPy