I am currently developing a text classification tool using Keras. It works (it works fine and I got up to 98.7 validation accuracy) but I can’t wrap my head around about how exactly 1D-convolution layer works with text data.
What hyper-parameters should I use?
I have the following sentences (input data):
- Maximum words in the sentence: 951 (if it’s less – the paddings are added)
- Vocabulary size: ~32000
- Amount of sentences (for training): 9800
- embedding_vecor_length: 32 (how many relations each word has in word embeddings)
- batch_size: 37 (it doesn’t matter for this question)
- Number of labels (classes): 4
It’s a very simple model (I have made more complicated structures but, strangely it works better – even without using LSTM):
model = Sequential() model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length)) model.add(Conv1D(filters=32, kernel_size=2, padding='same', activation='relu')) model.add(MaxPooling1D(pool_size=2)) model.add(Flatten()) model.add(Dense(labels_count, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary())
My main question is: What hyper-parameters should I use for Conv1D layer?
model.add(Conv1D(filters=32, kernel_size=2, padding='same', activation='relu'))
If I have following input data:
- Max word count: 951
- Word-embeddings dimension: 32
Does it mean that
filters=32 will only scan first 32 words completely discarding the rest (with
kernel_size=2)? And I should set filters to 951 (max amount of words in the sentence)?
Examples on images:
So for instance this is an input data: http://joxi.ru/krDGDBBiEByPJA
It’s the first step of a convoulution layer (stride 2): http://joxi.ru/Y2LB099C9dWkOr
It’s the second step (stride 2): http://joxi.ru/brRG699iJ3Ra1m
filters = 32, layer repeats it 32 times? Am I correct?
So I won’t get to say 156-th word in the sentence, and thus this information will be lost?
I would try to explain how 1D-Convolution is applied on a sequence data. I just use the example of a sentence consisting of words but obviously it is not specific to text data and it is the same with other sequence data and timeseries.
Suppose we have a sentence consisting of
m words where each word has been represented using word embeddings:
Now we would like to apply a 1D convolution layer consisting of
n different filters with kernel size of
k on this data. To do so, sliding windows of length
k are extracted from the data and then each filter is applied on each of those extracted windows. Here is an illustration of what happens (here I have assumed
k=3 and removed the bias parameter of each filter for simplicity):
As you can see in the figure above, the response of each filter is equivalent to the result of its convolution (i.e. element-wise multiplication and then summing all the results) with the extracted window of length
(i+k-1)-th words in the given sentence). Further, note that each filter has the same number of channels as the number of features (i.e. word-embeddings dimension) of the training sample (hence performing convolution, i.e. element-wise multiplication, is possible). Essentially, each filter is detecting the presence of a particular feature of pattern in a local window of training data (e.g. whether a couple of specific words exist in this window or not). After all the filters have been applied on all the windows of length
k we would have an output of like this which is the result of convolution:
As you can see, there are
m-k+1 windows in the figure since we have assumed that the
stride=1 (default behavior of
Conv1D layer in Keras). The
stride argument determines how much the window should slide (i.e. shift) to extract the next window (e.g. in our example above, a stride of 2 would extract windows of words:
(1,2,3), (3,4,5), (5,6,7), ... instead). The
padding argument determines whether the window should entirely consists of the words in training sample or there should be paddings at the beginning and at the end; this way, the convolution response may have the same length (i.e.
m and not
m-k+1) as the training sample (e.g. in our example above,
padding='same' would extract windows of words:
(PAD,1,2), (1,2,3), (2,3,4), ..., (m-2,m-1,m), (m-1,m, PAD)).
You can verify some of the things I mentioned using Keras:
from keras import models from keras import layers n = 32 # number of filters m = 20 # number of words in a sentence k = 3 # kernel size of filters emb_dim = 100 # embedding dimension model = models.Sequential() model.add(layers.Conv1D(n, k, input_shape=(m, emb_dim))) model.summary()
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv1d_2 (Conv1D) (None, 18, 32) 9632 ================================================================= Total params: 9,632 Trainable params: 9,632 Non-trainable params: 0 _________________________________________________________________
As you can see the output of convolution layer has a shape of
(m-k+1,n) = (18, 32) and the number of parameters (i.e. filters weights) in the convolution layer is equal to:
num_filters * (kernel_size * n_features) + one_bias_per_filter = n * (k * emb_dim) + n = 32 * (3 * 100) + 32 = 9632.