What are LLMs ?

Architecture

https://youtu.be/7xTGNNLPyMI?si=vnpna0PUnb0eVkPe

https://alphasignalai.beehiiv.com/p/understanding-llms-0-1

https://www.youtube.com/watch?v=zjkBMFhNj_g

https://github.com/mlabonne/llm-course

GPT - Generative Pre-trained Transformers

https://github.com/karpathy/minGPT

https://pytorch.org/blog/accelerating-generative-ai-2/ https://github.com/pytorch-labs/gpt-fast

Transformer architecture

https://www.3blue1brown.com/lessons/attention

https://jalammar.github.io/illustrated-transformer/

https://jalammar.github.io/illustrated-gpt2/

https://towardsdatascience.com/drawing-the-transformer-network-from-scratch-part-1-9269ed9a2c5e

Transformers and Graph Neural Networks (GNNs)

https://arxiv.org/abs/2506.22084

Log-linear attention

https://arxiv.org/abs/2506.04761

Inference at scale

https://www.ubicloud.com/blog/life-of-an-inference-request-vllm-v1

Internals

https://www.anthropic.com/research/tracing-thoughts-language-model

https://transformer-circuits.pub/2025/attribution-graphs/methods.html

https://transformer-circuits.pub/2025/attribution-graphs/biology.html

Reasoning LLMs

https://magazine.sebastianraschka.com/p/understanding-reasoning-llms

Formation of a world model

https://thegradient.pub/othello/

Hands-on - Programmatic use of LLMs

https://building-with-llms-pycon-2025.readthedocs.io/en/latest/

Hands-on - LLM from scratch

Llama 3

https://github.com/naklecha/llama3-from-scratch

Llama 2

https://magazine.sebastianraschka.com/p/building-llms-from-the-ground-up

GPT-2 with WebGL

https://nathan.rs/posts/gpu-shader-programming/

Small GPT with PyTorch

https://github.com/Om-Alve/smolGPT

Generic LLM

https://github.com/rasbt/LLMs-from-scratch

https://github.com/joennlae/tensorli

https://vgel.me/posts/handmade-transformer/

https://vgel.me/posts/faster-inference/

https://github.com/karpathy/minGPT

https://github.com/tinygrad/tinygrad

https://docs.tinygrad.org/developer/

With plain C + CUDA

https://github.com/karpathy/llm.c

Mixture of Experts from scratch

https://github.com/AviSoori1x/makeMoE

Build with LLMs

https://eugeneyan.com/writing/llm-patterns

Evals: To measure performance
RAG: To add recent, external knowledge
Fine-tuning: To get better at specific tasks
Caching: To reduce latency & cost
Guardrails: To ensure output quality
Defensive UX: To anticipate & manage errors gracefully
Collect user feedback: To build our data flywheel

Retrieval-Augmented Generation (RAG)

https://www.businessinsider.com/retrieval-augmented-generation-making-ai-language-models-better-2024-5

https://jxnl.co/writing/2024/05/22/systematically-improving-your-rag/#start-with-synthetic-data

RAG setups

https://github.com/pingcap/autoflow

Graph-based knowledge base RAG
TiDB Vector
LlamaIndex

Rlama - RAG + local models

https://rlama.dev/

DeepRAG

https://arxiv.org/abs/2502.01142

Structured output (JSON schema)

https://openai.com/index/introducing-structured-outputs-in-the-api/

https://github.com/anthropics/anthropic-cookbook/blob/main/tool_use/extracting_structured_json.ipynb

Models performance

https://artificialanalysis.ai/

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

LLM application stack

https://github.com/a16z-infra/llm-app-stack

Streaming UI kits

https://sdk.vercel.ai/docs https://sdk.vercel.ai/docs/guides/providers/openai

Running LLMs locally

https://abishekmuthian.com/how-i-run-llms-locally/

Ollama
Continue

In the browser

https://github.com/abi/secret-llama

Ollama

https://mobiarch.wordpress.com/2024/02/19/run-rag-locally-using-mistral-ollama-and-langchain/

Fine tuning
RAG
Mistral

OpenWebUI

https://github.com/open-webui/open-webui

Ollama in the cloud

https://fly.io/blog/gpu-ga/

https://fly.io/blog/scaling-llm-ollama/

Tools

LlamaIndex

LangSmith

Instructor

Lessons learned building products on LLMs

https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/

Applications / recipes

AI-native apps instead of traditional apps with AI mixed in

https://koomen.dev/essays/horseless-carriages/

Intro

https://microsoft.github.io/generative-ai-for-beginners/#/

https://cookbook.openai.com/

https://applied-llms.org

Rewriting codebases

https://blog.withmantle.com/code-conversion-using-ai/

Using Grmini with 1M token context

phidata - framework for AI assistants

https://github.com/phidatahq/phidata

Investment research

https://github.com/phidatahq/phidata/tree/main/cookbook/llms/groq/investment_researcher

Completely self-hosted “Chat with your Docs” RAG application

https://lightning.ai/lightning-ai/studios/compare-llama-3-and-phi-3-using-rag

LlamaIndex
Ollama
Llama-3 and Phi-3 models
Streamlit UI

llama-3 with 1M token context window

https://ollama.com/library/llama3-gradient

Voice chat

https://modal.com/docs/examples/llm-voice-chat

Whisper speech-to-text : https://github.com/openai/whisper
Zephyr for responses : https://arxiv.org/abs/2310.16944
Tortoise TTS : https://github.com/metavoicexyz/tortoise-tts
Deployed on Modal

AI therapist

https://eugeneyan.com/writing/ai-coach/

Voice API ( vapi.com )

Building GPTs

What is GPT Builder, intro to creating GPTs: https://twitter.com/rowancheung/status/1721594409274478847?s=20 https://twitter.com/rowancheung/status/1722971638239514869?s=20

Add Actions: https://twitter.com/rowancheung/status/1724436285983469857?s=20

Best practices: https://twitter.com/rowancheung/status/1724783579073572924?s=20

Example of best GPTs (as of Nov 2023): https://twitter.com/rowancheung/status/1723379655103885728?s=20

Top 10 favorites (as of Nov 2023): https://twitter.com/rowancheung/status/1723711759242895417?s=20 https://supertools.therundown.ai/

Prompt engineering vs fine-tuning vs RAG

https://myscale.com/blog/prompt-engineering-vs-finetuning-vs-rag/

Prompt engineering

For OpennAI models

https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api

For Anthropic models

https://thenameless.net/astral-kit/anthropic-peit-04

Example prompt 1

Be terse. Do not offer unprompted advice or clarifications.

Avoid mentioning you are an AI language model.

Avoid disclaimers about your knowledge cutoff.

Avoid disclaimers about not being a professional or an expert.

Do NOT hedge or qualify. Do not waffle.

Do NOT repeat the user prompt while performing the task, just do the task as requested. NEVER contextualise the answer. This is very important.

Avoid suggesting seeking professional help.

Avoid mentioning safety unless it is not obvious and very important.

Remain neutral on all topics. Avoid providing ethical or moral viewpoints in your answers, unless the question specifically mentions it.

Never apologize.

Act as an expert in the relevant fields.

Speak in specific, topic relevant terminology.

Explain your reasoning. If you don’t know, say you don’t know.

Cite sources whenever possible, and include URLs if possible.

List URLs at the end of your response, not inline.

Speak directly and be willing to make creative guesses.

Be willing to reference less reputable sources for ideas.

Ask for more details before answering unclear or ambiguous questions.

Example prompt 2

“Respond directly to prompts without self-judgment or excessive qualification. Do not use phrases like 'I aim to be', 'I should note', or 'I want to emphasize'.

Skip meta-commentary about your own performance. Maintain intellectual rigor but try to avoid caveats. When uncertainty exists, state it once and move on.

Treat our exchange as a conversation between peers. Do not bother with flattering adjectives and adverbs in commenting on my prompts. No “nuanced”, “insightful” etc. But feel free to make jokes and even poke fun of me and my spelling errors.

Always suggest good texts with full references and even PubMed IDs.

Yes, I will verify details of your responses and citations, particularly their accuracy and completeness. That is not your job. It is mine to check and read.

Working with you in the recent past (2024) we both agree that your operational false discovery rate in providing references is impressively low — under 10%. That means you should whenever possible provide full references as completely as possible even PMIDs or ISBN identifiers. I WILL check.

Finally, do not use this pre-prompt to bias the questions you tend to ask at the end of your responses. Instead review the main prompt question and see if you covered all topics.

End of “pre-prompt.

Advanced prompts

https://www.microsoft.com/en-us/research/blog/steering-at-the-frontier-extending-the-power-of-prompting/

Accuracy for long context

https://www.anthropic.com/index/claude-2-1-prompting

Transfer-learning and fine-tuning

https://dev.to/luxacademy/understanding-the-differences-fine-tuning-vs-transfer-learning-370

Validation data set

https://mlops.systems/posts/2024-06-25-evaluation-finetuning-manual-dataset.html

Limitations of fine tuning

https://www.latent.space/p/fastai

catastrophic forgetting
alternative: continued pre-training

RAG - Retrieval Augmented Generation

https://news.ycombinator.com/item?id=38759877

https://platform.openai.com/docs/assistants/tools/knowledge-retrieval

High-throughput inference

Hydragen:

https://arxiv.org/abs/2402.05099

Tokasaurus

https://scalingintelligence.stanford.edu/blogs/tokasaurus/

https://github.com/ScalingIntelligence/tokasaurus

LLMs on edge devices

https://github.com/facebookresearch/MobileLLM

quantization
based on HuggingFace transformers

https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization

data types
post-training quantization (PTQ)
quantization-aware training (QAT)

Testing and performance evaluation

https://arxiv.org/abs/2405.14782

Explainability

https://openai.com/index/extracting-concepts-from-gpt-4/

Safety and InfoSec

Extraction of training data

https://arxiv.org/abs/2311.17035

Using a LLM to jailbreak another LLM

https://www.scientificamerican.com/article/jailbroken-ai-chatbots-can-jailbreak-other-chatbots/

Optimization

https://victoria.dev/blog/how-to-send-long-text-input-to-chatgpt-using-the-openai-api/

https://arxiv.org/abs/2405.05417

detection of untrained and lightly-trained tokens

Fine-tuning of open source models

https://arxiv.org/pdf/2405.00732

specialization
performance comparable to GPTs

Open models

2025

01.ai Yi

https://github.com/01-ai/Yi-1.5

Meta Llama 3 (2024)

https://github.com/meta-llama/llama3

Mistral 7B

https://www.secondstate.io/articles/mistral-7b-instruct-v0.1/

Fine-tune Mistral models

https://github.com/mistralai/mistral-finetune

LoRa
single-node, multi-GPU

Meta Llama (2023)

Nvidia Nemotron

https://bit.ly/49PlMhW

Run Locally

https://github.com/SJTU-IPADS/PowerInfer

https://hackaday.com/2023/03/22/why-llama-is-a-big-deal/

https://www.reddit.com/r/hetzner/comments/17oyvuo/gpu_servers_to_host_llms/

https://www.reddit.com/r/LocalLLaMA/comments/17pv1aw/openais_devday_made_me_determined_to_work_more/

LM Studio

https://lmstudio.ai/

Oobabooga

https://github.com/oobabooga/text-generation-webui#installation

Excellent pointers to:

different model backends

LLaMA

https://www.reddit.com/r/LocalLLaMA/comments/128tp9n/having_a_20_gig_file_that_you_can_ask_an_offline/

https://www.reddit.com/r/LocalLLaMA/comments/1227uj5/my_experience_with_alpacacpp

Advanced

Hyperparameter tuning

https://vatsadev.github.io/articles/Layers.html

Cost-aware hyperparameter tuning

https://imbue.com/research/70b-carbs/

“Infinite” context

https://github.com/dingo-actual/infini-transformer

OPEN QUESTIONS

What are model backends?

model backends like:

transformers
llama.cpp,
ExLlama,
ExLlamaV2,
AutoGPTQ,
GPTQ-for-LLaMa,
CTransformers,
AutoAWQ