"Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging"

Jun 28, 2024

arXiv:2406.16330v1 [cs.CL] 24 Jun 2024

Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging. A most interesting article by Deyuan Liu et al.

This reminded me of a great article “Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases An informal note on some intuitions related to Mechanistic Interpretability by Chris Olah”.
https://transformer-circuits.pub/2022/mech-interp-essay/index.html

Aka the Curse of Dimensionality

As a personal stance I would posit that it is not a curse but a ‘dimensional variant’

Although our tech / brain is not yet up to speed to take advantage.

A Quick Definition

Dimensionality in the context of data science and machine learning refers to the number of features or attributes that describe each data point in a dataset. More formally:

Definition: In a dataset D, if each data point x is represented as a vector x = (x₁, x₂, ..., xₙ), where n is the number of features, then n is the dimensionality of the data.

For example:

- In a dataset of houses, dimensions might include square footage, number of bedrooms, price, location coordinates, etc.

- In image processing, each pixel could be considered a dimension, so a 100x100 pixel grayscale image would have 10,000 dimensions.

- In natural language processing, when using techniques like word embeddings, each word might be represented by a vector of several hundred dimensions.

When it comes to a dataset of words, the concept of dimensions is quite different from the house example.

In the simplest form, each word could be represented by a vector with a length equal to the vocabulary size, where only one element is 1 (corresponding to that word) and all others are 0. However, this is rarely used in modern NLP due to its inefficiency.

2. Word Embeddings: More commonly, words are represented by dense vectors. In this case:

- Each dimension doesn't correspond to a specific, interpretable feature like "number of bedrooms."

- Instead, each dimension contributes to representing semantic and syntactic properties of the word.

- The values in these dimensions are learned by the model during training.

3. Contextual Embeddings: In modern language models:

- Words don't have fixed representations.

- The embedding of a word changes based on its context in the sentence.

- Each dimension in these embeddings represents complex, learned features that capture various aspects of the word's meaning and usage in that specific context.

4. Subword Tokenization: Many modern models don't work with whole words, but with subword tokens. Each token is then represented by a high-dimensional vector.

5. Model Internal Representations: As the input moves through the layers of a neural network, the representations become increasingly abstract and task-specific.

So, for a dataset of words processed by a modern language model:

- There might be an initial embedding layer with X dimensions. Each of these X dimensions doesn't represent a specific, nameable attribute of the word.

- Instead, they collectively represent the word's meaning and properties in a way that's optimized for the model's language understanding tasks.

- These representations are then transformed multiple times as they pass through the model's layers.

This is fundamentally different from the house example, where each dimension had a clear, interpretable meaning. In word embeddings, the meaning is distributed across all dimensions in complex, often non-interpretable ways.
It's important to note that dimensionality can refer to:

1. Original dimensionality: The number of features in the raw data.

2. Intrinsic dimensionality: The actual number of dimensions needed to represent the underlying structure of the data, which may be lower than the original dimensionality.

When discussing the "curse of dimensionality," we're often referring to scenarios where the original dimensionality is high, potentially much higher than the intrinsic dimensionality or the number of samples in the dataset.

Understanding this definition helps clarify why certain challenges arise as dimensionality increases, and why techniques like dimensionality reduction can be effective in many cases.

So each dimension had a clear, interpretable meaning. In word embeddings, the meaning is distributed across all dimensions in complex, often non-interpretable ways., aka the word vectors. So word embeddings distribute meaning across all dimensions. So ipso facto word/ token semantic and syntactic information is essentially ‘encoded’ in code 🙂

But, there is a problem;

Complexity: The relationships between words that are encoded in complex, high-dimensional spaces, allow for rich representations of linguistic nuances that aren't easily reducible to simple, interpretable features. Or even interpretable or understandable by humans, since this happens in ms in parallel.

This complexity suggests that a model's capabilities may be emergent properties of the system, rather than explicitly programmed behaviors.

So in reality we're dealing with a form of information processing that operates on principles that may be fundamentally different from human cognition, both in terms of speed and the nature of the representations involved.

This perspective may be crucial for understanding the current state and future implications of AI language models rather than running away to hide behind a hill of myth.

We need to face the true nature of AI systems head-on, rather than retreating into comfortable but inaccurate narratives. All the myth based scare mongering is about control / monetization / perpetuation of the Government inspecting the inspectors..

It is most disappointing that world class intellects are forgoing their intellectual honesty in the imposition / promulgation of myth. I’m only a regular user but it does seem to call for a direct, unvarnished examination of AI systems, eschewing comforting fictions.

I’m so sick of the regurgitation of ethical obligation; surely it is the responsibility of ‘Experts’: and the ethical obligation of those with expertise to provide accurate information and analysis rather than pandering to the sensationalist end of world bs.

We have reached the most unenviable situation where we see an erosion of public trust of AI due to hyperbolic or inaccurate statements from supposed experts. Those afraid of the shadow of what they helped create now wield ‘words of shadow’, ironically.

Yes there are complex emotions and reactions, e.g. the invention of fire, run away afraid if you want, throw unsubstantive myth based stones from the sidelines.

Anyway, Back to Dimensionality

1. Redundancy as a Resource: The ability to compress models significantly while maintaining performance suggests that the high dimensionality of LLMs contains redundant information. This redundancy, far from being a curse, is potentially a valuable resource that can be leveraged for efficiency.

2. Compressibility: The success of Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA) in compressing models indicates that the high-dimensional representations in LLMs are highly compressible. This challenges the notion that high dimensionality is inherently problematic.

3. Information Density: The fact that performance can be largely maintained after significant compression suggests that the essential information in these models is more densely packed than previously thought.

4. Architectural Flexibility: The ability to merge layers based on manifold learning implies that the high-dimensional architecture of LLMs is more flexible and adaptable than it might appear.

5. Efficiency Potential: Rather than being a curse, high dimensionality in LLMs might be viewed as untapped potential for efficiency improvements.

6. Knowledge Alignment: The concept of aligning knowledge across layers suggests that high dimensionality allows for distributed representation of information, which can be realigned and compressed.

In essence, this research suggests that what we might have perceived as a "curse" of dimensionality in LLMs is actually a rich space of possibilities for optimization and efficiency. The high dimensionality provides a form of over-parameterization that, when properly leveraged, allows for powerful compression techniques without significant loss of capability.

I posit that dimensionality in LLMs is not a curse at all, but rather a feature that provides robustness, flexibility, and potential for optimization. It's a space of opportunity that we have yet to or are just beginning to explore.

Kudos to Deyuan Liu et al!

Des Donnelly

Discussion about this post