knowhere

Preparing for Coding Interviews (A journey) - The Problem Solving Process

2024-07-29T00:00:00+00:00

Documenting the steps I follow when I think through solutions for a programming problem

Preparing for Coding Interviews (A journey) - Noteworthy Data Structures and Algorithms

2024-07-26T00:00:00+00:00

Documenting a list of notable data structures and algorithms, that I can work towards mastering

Preparing for Coding Interviews (A journey) - Programming Language Selection

2024-07-24T00:00:00+00:00

Documenting the factors that goes into the selection of a programming language to write solutions in

Preparing for Coding Interviews (A journey)

2024-07-23T00:00:00+00:00

This series of posts documents the learnings, challenges i encountered and the decisions i made on my journey to hone the (algorithmic) tools of my trade. It’s partly a tool for me to arrange my thoughts on this, partly a resource for me to revisit every now and then, and partly to serve as a guide for anyone who might find it useful.

For any Hiring Managers/TA Partners seeing this, drop me an email if you like what you see!

Posts

Vision Language Models

2024-04-22T00:00:00+00:00

Key Ideas

Images can be represented as a collection of visual “words” or patches, allowing the attention mechanism to be applied to it
Architectures for Vision and Text models are converging, allowing for native multimodality

Notes

Vision Basics

Representation

Grayscale images are matrices, Color images are tensors
Image pixel values exist in a fixed range, called a Color Space

Convolution Networks

Given a convolution mask k, we can create a representation of an image
- g(x, y) = Sum_{v} Sum_{u} k(u, v)f(x - u, y - v) where f is the input representation, g is the new representation
- Can do things such as: “Sharpen”, “Find Edges”, “Blur”, etc.
Stack enough depth, and the network will learn more complex features (image -> edges -> groups of edges -> collections of interesting features)

Transformer Networks

Apply self attention on pixel values
- Problems around extracting 2D relation information from the image and local vs global attention
Use patches instead of pixels (dModel = 768!)

Multimodality

Contrastive Language–Image Pre-training (CLIP)

Step 1: Train a model to maximise the similarity scores between image encoding and corresponding text encoding
Step 2: Given an input in a mode, create encodings for all potential counterparts in the other mode
Step 3: Find similarity scores between encoding of input with counterparts and select the highest one

Fuyu

Step 1: Create image patches, encode it into linear vectors to become “words”
- Special mention: Image “newline” character!
Step 2: Append the image patch vector sequence with text words vectors and feed into the Transformer architecture
Step 3: Only perform prediction on the output embeddings corresponding to the text

Aside: V* Vision Search Some features might be too insignificant in the larger context to be accurately pinpointed by traditional methods. The V* method might be helpful

Step 1: Use LLM to identify patches which may contain subject
Step 2: Use high likelihood patches as inputs for actual query

Resources

Dr. Mohit Iyyer’s UMass CS685 S24 Lecture 19
An image is worth 16x16 words
OpenAI’s CLIP
AdeptAI’s Fuyu

Scaling Laws for Large Language Models

2024-04-10T00:00:00+00:00

Key Ideas

There seems to be an upper cap on LLM performance if compute budget is kept fixed, i.e. capacity of a model
Increasing either data or model parameters alone is not enough to improve performance
Various studies have found quantified relationships between data size, model size and compute budget which can be used to inform how much resource to use when training an LLM.

Notes

Training Tradeoffs

Factors informing training are

Dataset Size (D # of training tokens)
Model Size (N # of model parameters)
Compute Budget (C = Flops(N, D))

To solve: argmin_{N, D} L(N, D) s.t. FLOPS(N, D) = C where L(N, D) = A/(N^alpha) + B/(D^beta) + E

Kaplan Scaling Laws

Findings from Kaplan et al., 2020

Performance depends strongly on scale and weakly on model shape
Increasing both dataset and model size is key to improved performance. Increasing one while keeping the other fixed leads to diminished returns (assuming uncapped compute)
Larger models are more sample efficient
Prioritize increasing model size over data size

Issue

Same learning rate schedule was used for all training runs, regardless of batch size

Chinchilla Scaling Laws

Findings from Hoffman et al., 2022

Fixed the learning rate schedules
Diff from Kaplan: Increase the data and model with the same factor
Based off a fixed compute budget, they found two linear relationships - one for model size and one for dataset size

Resources

Dr. Mohit Iyyer’s UMass CS685 S24 Lecture 17

Position Embeddings and Efficient Attention

2024-04-08T00:00:00+00:00

Key Ideas

Position Encodings give the model a notion of order

Notes

Embedding Format

Type 1: Absolute Positions q_1 = w_q . (c_1 + p_1)

Fixed Format

Allows for arbitrary length input sequences, esp. at test time
Practically, the model does not effectively learn this

Learned

Lets the model figure out the best format to encode this information
Cannot be used for longer length sequences at test time than that set at train time

Type 2: Relative Positions

Represent every pair of tokens, and measure the relative positions difference between them
Could be better suited as input sequences might have variations, and text might be prepended/truncated - changing the absolute positions while keeping the relative position difference the same
Cannot be directly added to input embedding (exception: RoPE). Instead, directly modifies the attention matrix

ALiBi

Decay the q.k dot products in the attention calculation by the difference in positions
- Mask = [[0, -inf, -inf, -inf], [-1, 0, -inf, -inf], [-2, -1, 0, -inf], [-3, -2, -1, 0]] * m where m is a ‘magnitude’/’slope’ which is a hyperparameter, and only varies between attention heads
Enables extrapolation beyond training sequence length
Position information does not affect v

Rotary Position Embeddings (RoPE)

Rotate q by angle x. Rotate k by angle y. q.k will only have have information about the relative position difference encoded, not the absolute position diff
i.e. f_q(c_4, 4) = q_4, f_k(c_1, 1) = k_1, q_4.k_1 = g(c_4, c_1, 4-1). Find f_q, f_k, g
f_q(c_t, t) = R_{theta, t} = [[cos(t*theta) -sin(t*theta)], [sin(t*theta) cos(t*theta)]] where theta is a hyperparameter

Optimized Attention Computation Strategies

Attention calculation is a quadratic complexity operation. This can be improved upon by special consideration.

Flash Attention

Rather than storing results of one intermediate operations back into memory and reading them again for the next, create a new operation which does all these steps in one go, saving on wasteful memory I/O operations

Ring Attention

Break the attention computation down into chunks, assign a chunks of the subsequence to its own dedicated GPU
Forward results to the next GPU, which are all arranged in a grid
Eventually, after n forwards (where n is the number of GPUs), every GPU’s memory will have the full attention score for its own subsequence

Resources

Dr. Mohit Iyyer’s UMass CS685 S24 Lecture 16

Evaluating LLM-generated text

2024-04-03T00:00:00+00:00

Key Ideas

Human judgement can be learnt by LLMs to serve as replacements for text quality evaluation tasks
However this runs into problems of generalization and LLM specific bias

Notes

Fixed Scope Task Evaluation - Human Evaluation

Generally indicated by scores based on subjective measures on a 5 point scale

Adequacy: Is the meaning correct?
Fluent: Is it easy to read?

Cons

Subjective!
Difficult to calibrate
Expensive and time consuming

Fixed Scope Task Evaluation - Automatic Metrics

Precision, Recall and F-Scores

Use the precision (common words in y_cap and y_pred/y_pred length), the recall (common words in y_cap and y_pred/y_cap length) and F-Score (precision * recall/((precision + recall)/2))

Pros

Quick, easy and cheap

Cons

Does not handle synonyms
Does not handle order
We may not at always have a reference
Does not take into account meaning of the constructed sentence

Bilingual Evaluation Understudy (BLEU)

Based off n-gram overlap between y_pred and y_cap
Computes precision for n-grams of size 1 to 4.
Has a brevity penalty
`BLEU = min(1, output-length/ref-length) (PROD_{i, 1:4} precision_i)^1/4
Allows use of multiple references, and we can match against all refs, so recall will not be very useful
- Closest reference length is used

Cons

All words/n-grams are treated equally
Human translations also score lower than machines
The score does give any indication, cannot be used comparatively with other test strings

ROUGE

Based off n-gram overlap between y_cap and y_pred
Computes recall for n-grams of size 1 to 4
Used for text summarization systems

Cons

All words/n-grams are treated equally
Human translations also score lower than machines
The score does not matter
Can game the score by just replicating the string n-times

Fixed Scope Task Evaluation - Learned Metrics

Finetune a model directly on scores from human evaluations to perform evaluation

BLEURT

Finetune a pretrained BERT model on synthetic tasks with perturbed data and automatic metrics. Then finetune again with human evaluation metrics.

COMET

SOTA LLM-as-a-judge for similarity scores

Open Ended Task Evaluations - Human Evaluations

Challenges when using humans

Subjective
Needs experts to evaluate
Annotators might not do a good job

Long Eval

Split long form text into atomic claims
Get each claim verified for support from the long form text
- (optional) Send only subset of claims to each individual annotator to reduce workload
Calc %age of facts being supported by the text

Open Ended Task Evaluations - LLM Evaluations

GPTEval

Create a prefix/context for LLM with instructions on how to evaluate and to give a score
Aggregate scores w/ probabilities to calculate evaluation

Win Rate

Ask LLM to select one of two outputs and use that as a score
One of the two outputs should ideally come from the same base model to ensure fair comparison between two models
Caveat: The selected annotator model might prefer a specific class of model (responses created by OpenAI models may be preferred by OpenAI models)

Decompose, Eval and Aggregate

Use an LLM to break down a text into claims
Verify each claim with an LLM + an evidence source
Calculate the retrieval score

Resources

Dr. Mohit Iyyer’s UMass CS685 S24 Lecture #15
Chatbot Arena

AI - Adversarial Search

2024-04-01T00:00:00+00:00

In a nutshell

Key Ideas

Notes

Topic 1

Misc

Needs Exploration

Resources

AI - Constraint Satisfaction Problems

2024-04-01T00:00:00+00:00

In a nutshell

Key Ideas

Notes

Constraint Satisfaction Problems (CSPs)

Basic Properties

Variables (X = X_1, X_2, X_3, ..., X_n) are all linear rational values
- X_i belongs to domain D_i
Constraints (C) are all linear
- Constraints list which variables are involved and how
Effective solvers reduce search space significantly and quickly w/ use of variable dependencies
Objective: Find a legal assignment of values (y = y_1, y_2, y_3, ..., y_n) to variables such that all constraints are satisfied
- Complete: All variables are set
- Consistent: No constraint is violated
States are partial assignments of the variables
Can be encoded as a Constraint Graph where Nodes are variables, Edges are constraints

An example Variables: X = {WA, NT, Q, NSW, V, SA, T} Domains: D_i = {R, G, B} Constraints: If (X_i, X_j) in edges (E), then color(X_i) =/= color(X_j)

Graph coloring of the territories in Australia, with no adjacent territory sharing the same color

Variations

Variable type
- Discrete
  - Generally considered computationally intractable problems
- Continuous
  - Generally considered easier
  - linear programming problems are solvable in polynomial time
Domain type
- Finite Domains: e.g. 8-queens
- Infinite Domains: e.g. Job-Shop Scheduling
Constraint type
- Unary: One variable e.g. SA =/= G
- Binary: Two variables e.g. SA =/= WA
- Global (higher order): 3 or more variables e.g. X_1 + X_2 - 4*X_7 <= 15

knowhere

Preparing for Coding Interviews (A journey) - The Problem Solving Process

Preparing for Coding Interviews (A journey) - Noteworthy Data Structures and Algorithms

Preparing for Coding Interviews (A journey) - Programming Language Selection

Preparing for Coding Interviews (A journey)

Posts

Vision Language Models

Key Ideas

Notes

Vision Basics

Multimodality

Resources

Scaling Laws for Large Language Models

Key Ideas

Notes

Training Tradeoffs

Kaplan Scaling Laws

Chinchilla Scaling Laws

Resources

Position Embeddings and Efficient Attention

Key Ideas

Notes

Embedding Format

Optimized Attention Computation Strategies

Resources

Evaluating LLM-generated text

Key Ideas

Notes

Fixed Scope Task Evaluation - Human Evaluation

Fixed Scope Task Evaluation - Automatic Metrics

Fixed Scope Task Evaluation - Learned Metrics

Open Ended Task Evaluations - Human Evaluations

Open Ended Task Evaluations - LLM Evaluations

Resources

AI - Adversarial Search

In a nutshell

Key Ideas

Notes

Topic 1

Misc

Needs Exploration

Resources

AI - Constraint Satisfaction Problems

In a nutshell

Key Ideas

Notes

Constraint Satisfaction Problems (CSPs)

Backtracking Search for CPS

Local Consistency

Local Search

Misc

Needs Exploration

Resources