classifier.fit_generator(training_set, steps_per_epoch=8000, epochs=iEpochs, validation_data=test_set, validation_steps=2000, callbacks = [tbCallBack]) # Last param is a late addition
Gives this:
Epoch 1/2
2019-03-29 12:33:09.921019: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-03-29 12:33:09.921056: E tensorflow/stream_executor/cuda/cuda_dnn.cc:337] Possibly insufficient driver version: 410.104.0
2019-03-29 12:33:09.921062: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-03-29 12:33:09.921070: E tensorflow/stream_executor/cuda/cuda_dnn.cc:337] Possibly insufficient driver version: 410.104.0
And a stacktrace, and then this
UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_2/convolution}}]]
[[loss_1/mul/_185]]
So - my job for today is working out, like, wtf dude, to try next. As it were.
(Yes - it is only two epochs - that's just while I am checking to see if it works, once it's running that goes way up, which is why I've parameterised the number.)
let's start looking here:
https://github.com/tensorflow/tensorflow/issues/24828]
There seems to be a suggestion to add:
from tensorflow.compat.v1 import ConfigProto from tensorflow.compat.v1 import InteractiveSession config = ConfigProto() config.gpu_options.allow_growth = True session = InteractiveSession(config=config)
I'm using tf through keras, so not sure this will help, but let's try
Still doesn't work.
(p36TFGJT) jonathan@Wednesday:~$ nvidia-smi Fri Mar 29 12:58:16 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1070 Off | 00000000:01:00.0 On | N/A | | 0% 46C P8 17W / 151W | 1280MiB / 8116MiB | 21% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1289 G /usr/lib/xorg/Xorg 26MiB | | 0 1331 G /usr/bin/gnome-shell 51MiB | | 0 1602 G /usr/lib/xorg/Xorg 497MiB | | 0 1719 G /usr/bin/gnome-shell 203MiB | | 0 3783 G ...equest-channel-token=798233842220658081 197MiB | | 0 5478 G /usr/bin/nvidia-settings 0MiB | | 0 5925 G ...-token=A5215F1CE4347817C36139407E5E1125 58MiB | | 0 11552 G ...than/anaconda3/envs/p36TFGJT/bin/python 3MiB | | 0 11675 C ...than/anaconda3/envs/p36TFGJT/bin/python 235MiB | +-----------------------------------------------------------------------------+ (p36TFGJT) jonathan@Wednesday:~$ dpkg -l | grep -i cudnn ii libcudnn7 7.5.0.56-1+cuda10.0 amd64 cuDNN runtime libraries ii libcudnn7-dev 7.5.0.56-1+cuda10.0 amd64 cuDNN development libraries and headers (p36TFGJT) jonathan@Wednesday:~$
In case that's of interest (it kind of is to me).
This dude has what sounds like the sam eproblem as me:
https://github.com/tensorflow/tensorflow/issues/22056#issuecomment-470749095
https://github.com/tensorflow/tensorflow/issues/22056#issuecomment-471091775
So - the solution there was:
I found this, after all, to work for me :
-from software & updates > Additional Drivers > choose nvidia-418(downloads and install it)
reboot PC
As result got upgraded to cuda-10.1 (from 10.0). It works for now !
I'm not sure i want to try this though... First off, let's do a full back up (using Deja Dup, back in time didn't really work out for me).
But let's do it. The backup completed, the update ran. I'm back on 418 drivers.
Restart, run GPU tests. Close / reopen spyder.
Results are in:
Using TensorFlow backend.
2019-03-29 14:32:24.335488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1674] Adding visible gpu devices: 0
2019-03-29 14:32:24.335517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1082] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-29 14:32:24.335521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1088] 0
2019-03-29 14:32:24.335524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1101] 0: N
2019-03-29 14:32:24.335761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1222] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7262 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
About to start model training
At: (2019, 3, 29, 14, 32)
Epoch 1/2
2019-03-29 14:32:32.916800: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-03-29 14:32:33.184104: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
8000/8000 [==============================] - 1009s 126ms/step - loss: 0.4253 - acc: 0.7970 - val_loss: 0.6266 - val_acc: 0.7519
Epoch 2/2
8000/8000 [==============================] - 1012s 126ms/step - loss: 0.1642 - acc: 0.9349 - val_loss: 1.0975 - val_acc: 0.7356
Finished in 2028.7929067611694
Finished at: (2019, 3, 29, 15, 6)
Model time was 2021.7227563858032
Yay, it ran on the GPU.
Hmm, it was far slower than my CPU. Time to ponder what's up.
No comments:
Post a Comment