Playing with machine learning: And we (may) have lift off / Houston we have a problem.

When last i wrote (last night), I'd run a successful tensorflow gpu build. I'd installed it in to a conda environment.

And on running I had a driver version mismatch. 418 gpu driver, 410 used everywhere else.

I didn't really look to fix that last night. I made a half-hearted attempt to install a default version again, just because I now knew I had all the supporting files, drivers, libraries.

None of that worked, obviously.

So I ate some junk food, watched some Expanse ("I'm that guy"!), wood working videos, went to bed. Got some sleep.

It's morning. Last workday of my weeks vacation. And I have some stuff to do today that isn't computer or youtube watching related.

I fired up my machine. Ensured i had the same error still, deleted some shoddy conda environments, did a little clean-up on the system and thought about it.

I need to fix the driver issue. Let's hiut a search engine with the "failed call to cuInit: CUDA_ERROR_SYSTEM_DRIVER_MISMATCH: system has unsupported display driver / cuda driver combination" error, and see what I see.

Going through the first hit (https://github.com/tensorflow/tensorflow/issues/19266), just following along for s&g, the responder asks for nvidia-smi output. Hmm, i wonder what that does currently show on my machine.

I run it, and... it errors. Wrong driver version. Oh FFS! Good start. Looks like some of my system is now 410 and some 418 (rather than just having some libs compiled against a different version).

Try reinstalling the 418 driver from nvidia. That errors (in use error).

So I decide i may as well power off / power on, just to see if my machine actually still works.

I do, it does. Yay.

Run nvida-smi and...

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   49C    P0    41W / 151W |    367MiB /  8116MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1289      G   /usr/lib/xorg/Xorg                            26MiB |
|    0      1331      G   /usr/bin/gnome-shell                          51MiB |
|    0      1602      G   /usr/lib/xorg/Xorg                           193MiB |
|    0      1719      G   /usr/bin/gnome-shell                          93MiB |
+-----------------------------------------------------------------------------+

So - it's using 410 now. Open nvida server settings. Driver version 410.
My nvidia drivers now all appear to be 410. Interesting.

Let's fire up a clean python 36 environment, install my tensorflow gpu build, and see what happens!

tf.test.is_gpu_available(

    cuda_only=False,

    min_cuda_compute_capability=None

)

Which outputs:

2019-03-29 10:54:29.739032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1674] Adding visible gpu devices: 0

2019-03-29 10:54:29.739060: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1082] Device interconnect StreamExecutor with strength 1 edge matrix:

2019-03-29 10:54:29.739064: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1088]      0

2019-03-29 10:54:29.739067: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1101] 0:   N

2019-03-29 10:54:29.739136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1222] Created TensorFlow device (/device:GPU:0 with 7238 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)

Out[8]: True

My, that looks interesting. It seems to be gpu'ing.

I can't run my cnn script yet, as I don't have keras in that env. Let's get myself up and running, and see if i can execute that script without errors!

conda install -c conda-forge keras

Does include the following

The following NEW packages will be INSTALLED:

  <snip>
  tensorflow         conda-forge/linux-64::tensorflow-1.1.0-py36_0

  <snip>

So - I may have some further work to do...

Yeah, everything fails horribly. Let's try installing my own tensorflow gpu build over the top. Just as well I copied it, tmp has deleted the version there So - where did I copy it too? Time to go hunting I guess. Note to self, get some order around where i store self-built tensorflow.

Bugger, bugger, bugger. When i run pip it just reports i already have tf, and does nothing.

Time to try --ignore-installed (on a cloned environment, obviously)

pip install --ignore-installed BackupTF/gpu_wheel/tf_nightly-1.13.1-cp36-cp36m-linux_x86_64.whl

Which in turn did a lot of churn on some things i know can be a bit... stroppy. (protobuf for eg, I've had issues with version before).

Well, it did run, lets try it. The tensorflow-gpu tests script I have seems to be happy. Yay.

My cnn script fails with a PIL.Image import issue (which I've seen recently anyway). Lets go and fix that.

just because I can't resist it, anger being an energy and all that:

pip install pillow

Conda env switch (to force an activate in the env I'm in). My tf gpu tests still pass.

But my cnn script fails witht he following. Still it's progress!
UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}}]]
[[loss/mul/_91]]

Let's try and fix that next shall we?

Playing with machine learning

30 March 2019

And we (may) have lift off / Houston we have a problem.

No comments:

Post a Comment

And now for a little Cthulhu