What are LLMs ?

Architecture

https://alphasignalai.beehiiv.com/p/understanding-llms-0-1

https://www.youtube.com/watch?v=zjkBMFhNj_g

https://github.com/mlabonne/llm-course

GPT - Generative Pre-trained Transformers

https://github.com/karpathy/minGPT

https://pytorch.org/blog/accelerating-generative-ai-2/ https://github.com/pytorch-labs/gpt-fast

Transformer architecture

https://www.3blue1brown.com/lessons/attention

https://jalammar.github.io/illustrated-transformer/

https://jalammar.github.io/illustrated-gpt2/

https://towardsdatascience.com/drawing-the-transformer-network-from-scratch-part-1-9269ed9a2c5e

Formation of a world model

https://thegradient.pub/othello/

Hands-on - LLM from scratch

Llama 3

https://github.com/naklecha/llama3-from-scratch

Llama 2

https://magazine.sebastianraschka.com/p/building-llms-from-the-ground-up

Generic LLM

https://github.com/rasbt/LLMs-from-scratch

https://github.com/joennlae/tensorli

https://vgel.me/posts/handmade-transformer/

https://vgel.me/posts/faster-inference/

https://github.com/karpathy/minGPT

https://github.com/tinygrad/tinygrad

https://docs.tinygrad.org/developer/

With plain C + CUDA

https://github.com/karpathy/llm.c

Mixture of Experts from scratch

https://github.com/AviSoori1x/makeMoE

Build with LLMs

https://eugeneyan.com/writing/llm-patterns

  • Evals: To measure performance
  • RAG: To add recent, external knowledge
  • Fine-tuning: To get better at specific tasks
  • Caching: To reduce latency & cost
  • Guardrails: To ensure output quality
  • Defensive UX: To anticipate & manage errors gracefully
  • Collect user feedback: To build our data flywheel

Retrieval-Augmented Generation (RAG)

https://www.businessinsider.com/retrieval-augmented-generation-making-ai-language-models-better-2024-5

https://jxnl.co/writing/2024/05/22/systematically-improving-your-rag/#start-with-synthetic-data

RAG setups

https://github.com/pingcap/autoflow

  • Graph-based knowledge base RAG
  • TiDB Vector
  • LlamaIndex

Structured output (JSON schema)

https://openai.com/index/introducing-structured-outputs-in-the-api/

Models performance

https://artificialanalysis.ai/

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

LLM application stack

https://github.com/a16z-infra/llm-app-stack

Streaming UI kits

https://sdk.vercel.ai/docs https://sdk.vercel.ai/docs/guides/providers/openai

Running LLMs locally

https://abishekmuthian.com/how-i-run-llms-locally/

  • Ollama
  • Continue

In the browser

https://github.com/abi/secret-llama

Ollama

https://mobiarch.wordpress.com/2024/02/19/run-rag-locally-using-mistral-ollama-and-langchain/

  • Fine tuning
  • RAG
  • Mistral

OpenWebUI

https://github.com/open-webui/open-webui

Ollama in the cloud

https://fly.io/blog/gpu-ga/

https://fly.io/blog/scaling-llm-ollama/

Tools

LlamaIndex

LangSmith

Instructor

Lessons learned building products on LLMs

https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/

Applications / recipes

Intro

https://microsoft.github.io/generative-ai-for-beginners/#/

https://cookbook.openai.com/

https://applied-llms.org

Rewriting codebases

https://blog.withmantle.com/code-conversion-using-ai/

  • Using Grmini with 1M token context

phidata - framework for AI assistants

https://github.com/phidatahq/phidata

Investment research

https://github.com/phidatahq/phidata/tree/main/cookbook/llms/groq/investment_researcher

Completely self-hosted “Chat with your Docs” RAG application

https://lightning.ai/lightning-ai/studios/compare-llama-3-and-phi-3-using-rag

  • LlamaIndex
  • Ollama
  • Llama-3 and Phi-3 models
  • Streamlit UI

llama-3 with 1M token context window

https://ollama.com/library/llama3-gradient

Voice chat

https://modal.com/docs/examples/llm-voice-chat

AI therapist

https://eugeneyan.com/writing/ai-coach/

  • Voice API ( vapi.com )

Building GPTs

What is GPT Builder, intro to creating GPTs: https://twitter.com/rowancheung/status/1721594409274478847?s=20 https://twitter.com/rowancheung/status/1722971638239514869?s=20

Add Actions: https://twitter.com/rowancheung/status/1724436285983469857?s=20

Best practices: https://twitter.com/rowancheung/status/1724783579073572924?s=20

Example of best GPTs (as of Nov 2023): https://twitter.com/rowancheung/status/1723379655103885728?s=20

Top 10 favorites (as of Nov 2023): https://twitter.com/rowancheung/status/1723711759242895417?s=20 https://supertools.therundown.ai/

Prompt engineering vs fine-tuning vs RAG

https://myscale.com/blog/prompt-engineering-vs-finetuning-vs-rag/

Prompt engineering

https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api

  • specifically for OpenAI LLMs

https://thenameless.net/astral-kit/anthropic-peit-04

  • specifically for Anthropic Claude

Advanced prompts

https://www.microsoft.com/en-us/research/blog/steering-at-the-frontier-extending-the-power-of-prompting/

Accuracy for long context

https://www.anthropic.com/index/claude-2-1-prompting

Transfer-learning and fine-tuning

https://dev.to/luxacademy/understanding-the-differences-fine-tuning-vs-transfer-learning-370

Validation data set

https://mlops.systems/posts/2024-06-25-evaluation-finetuning-manual-dataset.html

Limitations of fine tuning

https://www.latent.space/p/fastai

  • catastrophic forgetting
  • alternative: continued pre-training

RAG - Retrieval Augmented Generation

https://news.ycombinator.com/item?id=38759877

https://platform.openai.com/docs/assistants/tools/knowledge-retrieval

LLMs on edge devices

https://github.com/facebookresearch/MobileLLM

  • quantization
  • based on HuggingFace transformers

https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization

  • data types
  • post-training quantization (PTQ)
  • quantization-aware training (QAT)

Testing and performance evaluation

https://arxiv.org/abs/2405.14782

Explainability

https://openai.com/index/extracting-concepts-from-gpt-4/

Safety and InfoSec

Extraction of training data

https://arxiv.org/abs/2311.17035

Using a LLM to jailbreak another LLM

https://www.scientificamerican.com/article/jailbroken-ai-chatbots-can-jailbreak-other-chatbots/

Optimization

https://victoria.dev/blog/how-to-send-long-text-input-to-chatgpt-using-the-openai-api/

https://arxiv.org/abs/2405.05417

  • detection of untrained and lightly-trained tokens

Fine-tuning of open source models

https://arxiv.org/pdf/2405.00732

  • specialization
  • performance comparable to GPTs

Open models

01.ai Yi

https://github.com/01-ai/Yi-1.5

Meta Llama 3

https://github.com/meta-llama/llama3

Mistral 7B

https://www.secondstate.io/articles/mistral-7b-instruct-v0.1/

Fine-tune Mistral models

https://github.com/mistralai/mistral-finetune

  • LoRa
  • single-node, multi-GPU

Meta Llama 2

Nvidia Nemotron

https://bit.ly/49PlMhW

Run Locally

https://github.com/SJTU-IPADS/PowerInfer

https://hackaday.com/2023/03/22/why-llama-is-a-big-deal/

https://www.reddit.com/r/hetzner/comments/17oyvuo/gpu_servers_to_host_llms/

https://www.reddit.com/r/LocalLLaMA/comments/17pv1aw/openais_devday_made_me_determined_to_work_more/

LM Studio

https://lmstudio.ai/

Oobabooga

https://github.com/oobabooga/text-generation-webui#installation

Excellent pointers to:

  • different model backends

LLaMA

https://www.reddit.com/r/LocalLLaMA/comments/128tp9n/having_a_20_gig_file_that_you_can_ask_an_offline/

https://www.reddit.com/r/LocalLLaMA/comments/1227uj5/my_experience_with_alpacacpp

Advanced

Hyperparameter tuning

https://vatsadev.github.io/articles/Layers.html

Cost-aware hyperparameter tuning

https://imbue.com/research/70b-carbs/

“Infinite” context

https://github.com/dingo-actual/infini-transformer

OPEN QUESTIONS

What are model backends?

model backends like:

  • transformers
  • llama.cpp,
  • ExLlama,
  • ExLlamaV2,
  • AutoGPTQ,
  • GPTQ-for-LLaMa,
  • CTransformers,
  • AutoAWQ