What are LLMs ?
Architecture
https://youtu.be/7xTGNNLPyMI?si=vnpna0PUnb0eVkPe
https://alphasignalai.beehiiv.com/p/understanding-llms-0-1
https://www.youtube.com/watch?v=zjkBMFhNj_g
https://github.com/mlabonne/llm-course
GPT - Generative Pre-trained Transformers
https://github.com/karpathy/minGPT
https://pytorch.org/blog/accelerating-generative-ai-2/ https://github.com/pytorch-labs/gpt-fast
Transformer architecture
https://www.3blue1brown.com/lessons/attention
https://jalammar.github.io/illustrated-transformer/
https://jalammar.github.io/illustrated-gpt2/
https://towardsdatascience.com/drawing-the-transformer-network-from-scratch-part-1-9269ed9a2c5e
Internals
https://www.anthropic.com/research/tracing-thoughts-language-model
https://transformer-circuits.pub/2025/attribution-graphs/methods.html
https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Reasoning LLMs
https://magazine.sebastianraschka.com/p/understanding-reasoning-llms
Formation of a world model
https://thegradient.pub/othello/
Hands-on - Programmatic use of LLMs
https://building-with-llms-pycon-2025.readthedocs.io/en/latest/
Hands-on - LLM from scratch
Llama 3
https://github.com/naklecha/llama3-from-scratch
Llama 2
https://magazine.sebastianraschka.com/p/building-llms-from-the-ground-up
GPT-2 with WebGL
https://nathan.rs/posts/gpu-shader-programming/
Small GPT with PyTorch
https://github.com/Om-Alve/smolGPT
Generic LLM
https://github.com/rasbt/LLMs-from-scratch
https://github.com/joennlae/tensorli
https://vgel.me/posts/handmade-transformer/
https://vgel.me/posts/faster-inference/
https://github.com/karpathy/minGPT
https://github.com/tinygrad/tinygrad
https://docs.tinygrad.org/developer/
With plain C + CUDA
https://github.com/karpathy/llm.c
Mixture of Experts from scratch
https://github.com/AviSoori1x/makeMoE
Build with LLMs
https://eugeneyan.com/writing/llm-patterns
- Evals: To measure performance
- RAG: To add recent, external knowledge
- Fine-tuning: To get better at specific tasks
- Caching: To reduce latency & cost
- Guardrails: To ensure output quality
- Defensive UX: To anticipate & manage errors gracefully
- Collect user feedback: To build our data flywheel
Retrieval-Augmented Generation (RAG)
https://jxnl.co/writing/2024/05/22/systematically-improving-your-rag/#start-with-synthetic-data
RAG setups
https://github.com/pingcap/autoflow
- Graph-based knowledge base RAG
- TiDB Vector
- LlamaIndex
Rlama - RAG + local models
DeepRAG
https://arxiv.org/abs/2502.01142
Structured output (JSON schema)
https://openai.com/index/introducing-structured-outputs-in-the-api/
https://github.com/anthropics/anthropic-cookbook/blob/main/tool_use/extracting_structured_json.ipynb
Models performance
https://artificialanalysis.ai/
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
LLM application stack
https://github.com/a16z-infra/llm-app-stack
Streaming UI kits
https://sdk.vercel.ai/docs https://sdk.vercel.ai/docs/guides/providers/openai
Running LLMs locally
https://abishekmuthian.com/how-i-run-llms-locally/
- Ollama
- Continue
In the browser
https://github.com/abi/secret-llama
Ollama
https://mobiarch.wordpress.com/2024/02/19/run-rag-locally-using-mistral-ollama-and-langchain/
- Fine tuning
- RAG
- Mistral
OpenWebUI
https://github.com/open-webui/open-webui
Ollama in the cloud
https://fly.io/blog/scaling-llm-ollama/
Tools
LlamaIndex
LangSmith
Instructor
Lessons learned building products on LLMs
https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/
Applications / recipes
AI-native apps instead of traditional apps with AI mixed in
https://koomen.dev/essays/horseless-carriages/
Intro
https://microsoft.github.io/generative-ai-for-beginners/#/
Rewriting codebases
https://blog.withmantle.com/code-conversion-using-ai/
- Using Grmini with 1M token context
phidata - framework for AI assistants
https://github.com/phidatahq/phidata
Investment research
https://github.com/phidatahq/phidata/tree/main/cookbook/llms/groq/investment_researcher
Completely self-hosted “Chat with your Docs” RAG application
https://lightning.ai/lightning-ai/studios/compare-llama-3-and-phi-3-using-rag
- LlamaIndex
- Ollama
- Llama-3 and Phi-3 models
- Streamlit UI
llama-3 with 1M token context window
https://ollama.com/library/llama3-gradient
Voice chat
https://modal.com/docs/examples/llm-voice-chat
- Whisper speech-to-text : https://github.com/openai/whisper
- Zephyr for responses : https://arxiv.org/abs/2310.16944
- Tortoise TTS : https://github.com/metavoicexyz/tortoise-tts
- Deployed on Modal
AI therapist
https://eugeneyan.com/writing/ai-coach/
- Voice API ( vapi.com )
Building GPTs
What is GPT Builder, intro to creating GPTs: https://twitter.com/rowancheung/status/1721594409274478847?s=20 https://twitter.com/rowancheung/status/1722971638239514869?s=20
Add Actions: https://twitter.com/rowancheung/status/1724436285983469857?s=20
Best practices: https://twitter.com/rowancheung/status/1724783579073572924?s=20
Example of best GPTs (as of Nov 2023): https://twitter.com/rowancheung/status/1723379655103885728?s=20
Top 10 favorites (as of Nov 2023): https://twitter.com/rowancheung/status/1723711759242895417?s=20 https://supertools.therundown.ai/
Prompt engineering vs fine-tuning vs RAG
https://myscale.com/blog/prompt-engineering-vs-finetuning-vs-rag/
Prompt engineering
For OpennAI models
For Anthropic models
https://thenameless.net/astral-kit/anthropic-peit-04
Example prompt 1
Be terse. Do not offer unprompted advice or clarifications.
Avoid mentioning you are an AI language model.
Avoid disclaimers about your knowledge cutoff.
Avoid disclaimers about not being a professional or an expert.
Do NOT hedge or qualify. Do not waffle.
Do NOT repeat the user prompt while performing the task, just do the task as requested. NEVER contextualise the answer. This is very important.
Avoid suggesting seeking professional help.
Avoid mentioning safety unless it is not obvious and very important.
Remain neutral on all topics. Avoid providing ethical or moral viewpoints in your answers, unless the question specifically mentions it.
Never apologize.
Act as an expert in the relevant fields.
Speak in specific, topic relevant terminology.
Explain your reasoning. If you don’t know, say you don’t know.
Cite sources whenever possible, and include URLs if possible.
List URLs at the end of your response, not inline.
Speak directly and be willing to make creative guesses.
Be willing to reference less reputable sources for ideas.
Ask for more details before answering unclear or ambiguous questions.
Example prompt 2
“Respond directly to prompts without self-judgment or excessive qualification. Do not use phrases like 'I aim to be', 'I should note', or 'I want to emphasize'.
Skip meta-commentary about your own performance. Maintain intellectual rigor but try to avoid caveats. When uncertainty exists, state it once and move on.
Treat our exchange as a conversation between peers. Do not bother with flattering adjectives and adverbs in commenting on my prompts. No “nuanced”, “insightful” etc. But feel free to make jokes and even poke fun of me and my spelling errors.
Always suggest good texts with full references and even PubMed IDs.
Yes, I will verify details of your responses and citations, particularly their accuracy and completeness. That is not your job. It is mine to check and read.
Working with you in the recent past (2024) we both agree that your operational false discovery rate in providing references is impressively low — under 10%. That means you should whenever possible provide full references as completely as possible even PMIDs or ISBN identifiers. I WILL check.
Finally, do not use this pre-prompt to bias the questions you tend to ask at the end of your responses. Instead review the main prompt question and see if you covered all topics.
End of “pre-prompt.
Advanced prompts
Accuracy for long context
https://www.anthropic.com/index/claude-2-1-prompting
Transfer-learning and fine-tuning
https://dev.to/luxacademy/understanding-the-differences-fine-tuning-vs-transfer-learning-370
Validation data set
https://mlops.systems/posts/2024-06-25-evaluation-finetuning-manual-dataset.html
Limitations of fine tuning
https://www.latent.space/p/fastai
- catastrophic forgetting
- alternative: continued pre-training
RAG - Retrieval Augmented Generation
https://news.ycombinator.com/item?id=38759877
https://platform.openai.com/docs/assistants/tools/knowledge-retrieval
LLMs on edge devices
https://github.com/facebookresearch/MobileLLM
- quantization
- based on HuggingFace
transformers
https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
- data types
- post-training quantization (PTQ)
- quantization-aware training (QAT)
Testing and performance evaluation
https://arxiv.org/abs/2405.14782
Explainability
https://openai.com/index/extracting-concepts-from-gpt-4/
Safety and InfoSec
Extraction of training data
https://arxiv.org/abs/2311.17035
Using a LLM to jailbreak another LLM
https://www.scientificamerican.com/article/jailbroken-ai-chatbots-can-jailbreak-other-chatbots/
Optimization
https://victoria.dev/blog/how-to-send-long-text-input-to-chatgpt-using-the-openai-api/
https://arxiv.org/abs/2405.05417
- detection of untrained and lightly-trained tokens
Fine-tuning of open source models
https://arxiv.org/pdf/2405.00732
- specialization
- performance comparable to GPTs
Open models
2025
01.ai Yi
https://github.com/01-ai/Yi-1.5
Meta Llama 3 (2024)
https://github.com/meta-llama/llama3
Mistral 7B
https://www.secondstate.io/articles/mistral-7b-instruct-v0.1/
Fine-tune Mistral models
https://github.com/mistralai/mistral-finetune
- LoRa
- single-node, multi-GPU
Meta Llama (2023)
Nvidia Nemotron
Run Locally
https://github.com/SJTU-IPADS/PowerInfer
https://hackaday.com/2023/03/22/why-llama-is-a-big-deal/
https://www.reddit.com/r/hetzner/comments/17oyvuo/gpu_servers_to_host_llms/
https://www.reddit.com/r/LocalLLaMA/comments/17pv1aw/openais_devday_made_me_determined_to_work_more/
LM Studio
Oobabooga
https://github.com/oobabooga/text-generation-webui#installation
Excellent pointers to:
- different model backends
LLaMA
https://www.reddit.com/r/LocalLLaMA/comments/1227uj5/my_experience_with_alpacacpp
Advanced
Hyperparameter tuning
https://vatsadev.github.io/articles/Layers.html
Cost-aware hyperparameter tuning
https://imbue.com/research/70b-carbs/
“Infinite” context
https://github.com/dingo-actual/infini-transformer
OPEN QUESTIONS
What are model backends?
model backends like:
- transformers
- llama.cpp,
- ExLlama,
- ExLlamaV2,
- AutoGPTQ,
- GPTQ-for-LLaMa,
- CTransformers,
- AutoAWQ