Replacing the LlamaDecoderLayer Class hugging Face With New LongNet #94
Replies: 1 comment
-
|
Thanks for posting, but questions about the Hugging Face transformer library would be out of the scope for this book. I think this question might be a better fit for the Hugging Face forums. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hey i hope you are doing Great this weekend
i would like to ask you Please a Technical Question !!
i working on the CodeLLama Model which Uses a Decoder-Only Model Transformer following Arch Blow
Main Task is replaced
Decoder-Onlywhich used Masked-Self-Attention and KV_cache with my ownEncoder-Onlywhich used Diltaed-Attention used in LongNethere the code Based on
I planned to Replace the Block of
LlamaDecoderLayerfollowing withinEncoder-onlyhere the Origin BlockDecoder-Onlyused inCodeLlama:with my own using Inherent from base Class From Hugging Face Here my Following Process i did to Replace with
Encoder-onlyStep 1 : Inherent From LlamaConfig To adjust the new parameters config used in my own Encoder model which used
Dilated Multi-heads AttentionOutput :
Step 2 : the only part i wanted to Replace is
self_attnand my own Multi-head-Dilaed Attention is following isLongNetbased Mechanism following code BlowHere
the Dilated Attentionusedflash_Attention_2is Optional based on GPU used arch supportA100orT4 GPUHere The Multi-head Dilated Attention
To do so and Repalce the Layer used Inherent base Class from Hugging face
Notation: As long as
is_causal=Nonethe learning of the Attention Mechanism is not masked which leads int Fully Learning Representation to produce the Embedding Space of Vectors of Tokens which means theEncoder-Onlylearns the feature Representation relevant between Tokens attended to Druing Dot-Product Similarity instead of `Decoder-Only used Masked-Attention which I am not interested to use at the pointStep 4 : ReConstructed the Model using Adjustment
Config ClassI did the followingNotation: i adjusted
num_hidden_layersonly for show caseconfig.num_hidden_layers = 2the origin param isnum_hidden_layers=32Notation: i didn't use Rotary Embedding Because of Attention used is Linear
Q 1 Correct me Please if i need to keep
Rotary Embeddingin myEncoder-OnlyOutput:
Finally Step: Transfer Learning The Weights Layers following
["q_proj", "k_proj", "v_proj", "o_proj"]FromDecoder-Onlyto `Encoder-Only``Here Comparing the New
Encoder-OnlywithDecoder-OnlyDecoder-Only used in CodeLlama
Encoder-Only used in CodeLlama with Adujsment i did
both are has similar linear Layers in the following
["q_proj", "k_proj", "v_proj", "o_proj"]the code i built to do Transfering the Weights
Output
Please Correct me if missed understanding anything Transform the CodeLlama to be Encoder-Only to learn the Embedding
Thank you so much for your advance
Beta Was this translation helpful? Give feedback.
All reactions