Playing with machine learning: Failed to get convolution algorithm

Running a cnn script for object notification ( a learning exercise on pictures of cats and dogs), which is what drove me to build a new pc in the first place, errors.

classifier.fit_generator(training_set,
                        steps_per_epoch=8000,
                        epochs=iEpochs,
                        validation_data=test_set,
                        validation_steps=2000,
                        callbacks = [tbCallBack]) # Last param is a late addition

Gives this:
Epoch 1/2
2019-03-29 12:33:09.921019: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-03-29 12:33:09.921056: E tensorflow/stream_executor/cuda/cuda_dnn.cc:337] Possibly insufficient driver version: 410.104.0
2019-03-29 12:33:09.921062: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-03-29 12:33:09.921070: E tensorflow/stream_executor/cuda/cuda_dnn.cc:337] Possibly insufficient driver version: 410.104.0

And a stacktrace, and then this

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_2/convolution}}]]
[[loss_1/mul/_185]]

So - my job for today is working out, like, wtf dude, to try next. As it were.

(Yes - it is only two epochs - that's just while I am checking to see if it works, once it's running that goes way up, which is why I've parameterised the number.)

let's start looking here:
https://github.com/tensorflow/tensorflow/issues/24828]

There seems to be a suggestion to add:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

I'm using tf through keras, so not sure this will help, but let's try
Still doesn't work.

(p36TFGJT) jonathan@Wednesday:~$ nvidia-smi
Fri Mar 29 12:58:16 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   46C    P8    17W / 151W |   1280MiB /  8116MiB |     21%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1289      G   /usr/lib/xorg/Xorg                            26MiB |
|    0      1331      G   /usr/bin/gnome-shell                          51MiB |
|    0      1602      G   /usr/lib/xorg/Xorg                           497MiB |
|    0      1719      G   /usr/bin/gnome-shell                         203MiB |
|    0      3783      G   ...equest-channel-token=798233842220658081   197MiB |
|    0      5478      G   /usr/bin/nvidia-settings                       0MiB |
|    0      5925      G   ...-token=A5215F1CE4347817C36139407E5E1125    58MiB |
|    0     11552      G   ...than/anaconda3/envs/p36TFGJT/bin/python     3MiB |
|    0     11675      C   ...than/anaconda3/envs/p36TFGJT/bin/python   235MiB |
+-----------------------------------------------------------------------------+
(p36TFGJT) jonathan@Wednesday:~$ dpkg -l | grep -i cudnn
ii  libcudnn7    7.5.0.56-1+cuda10.0    amd64        cuDNN runtime libraries
ii  libcudnn7-dev    7.5.0.56-1+cuda10.0    amd64        cuDNN development libraries and headers
(p36TFGJT) jonathan@Wednesday:~$

In case that's of interest (it kind of is to me).

This dude has what sounds like the sam eproblem as me:
https://github.com/tensorflow/tensorflow/issues/22056#issuecomment-470749095
https://github.com/tensorflow/tensorflow/issues/22056#issuecomment-471091775

So - the solution there was:

I found this, after all, to work for me :
-from software & updates > Additional Drivers > choose nvidia-418(downloads and install it)
reboot PC
As result got upgraded to cuda-10.1 (from 10.0). It works for now !

I'm not sure i want to try this though... First off, let's do a full back up (using Deja Dup, back in time didn't really work out for me).

But let's do it. The backup completed, the update ran. I'm back on 418 drivers.

Restart, run GPU tests. Close / reopen spyder.

Results are in:

Using TensorFlow backend.
2019-03-29 14:32:24.335488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1674] Adding visible gpu devices: 0
2019-03-29 14:32:24.335517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1082] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-29 14:32:24.335521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1088] 0
2019-03-29 14:32:24.335524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1101] 0: N
2019-03-29 14:32:24.335761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1222] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7262 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)

About to start model training
At: (2019, 3, 29, 14, 32)
Epoch 1/2
2019-03-29 14:32:32.916800: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-03-29 14:32:33.184104: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
8000/8000 [==============================] - 1009s 126ms/step - loss: 0.4253 - acc: 0.7970 - val_loss: 0.6266 - val_acc: 0.7519
Epoch 2/2
8000/8000 [==============================] - 1012s 126ms/step - loss: 0.1642 - acc: 0.9349 - val_loss: 1.0975 - val_acc: 0.7356

Finished in 2028.7929067611694
Finished at: (2019, 3, 29, 15, 6)

Model time was 2021.7227563858032

Yay, it ran on the GPU.

Hmm, it was far slower than my CPU. Time to ponder what's up.

Playing with machine learning

29 March 2019

Failed to get convolution algorithm

No comments:

Post a Comment

And now for a little Cthulhu