Playing with machine learning: Getting tensorflow working - Part 3

It's the next morning, I've had some coffee, I've lounged watching some youtube cat videos, or dashcam footage, or something equally stupid. Why would i want to watch updates on the collapse of UK politics (it's brexit week...), when I can watch kittens.

Let's try getting GPU support going.

I have all the libraries I need.

I reread the gpu install page: https://www.tensorflow.org/install/gpu

And i decide to try the tensorflow-gpu build that's already available:

pip install tensorflow-gpu

(In a cloned environment)

That seems to go reasonably well. I go through the previous dance of reinstall keras, deactivate / activate environments. I run my script, and I've got everything fixed,

I get various cuda errors, but my script runs. It runs slower than CPU only. Again, not by much, but slower.

back to google, and git issues to troll through. There are various import os / os.environment items I need to set.

I set them.

Then I find I have them wrong, so I fix them.

And it is still no faster.

After a few circles of that particular merry go round I am back to thinking I may as well actually try building it. Just fix last nights issues, and build.

I go through the build again, with nothing changed, just to check I get the same issues. Which are:

Starting local Bazel server and connecting to it...
WARNING: The following configs were expanded more than once: [cuda]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
ERROR: Skipping '//tensorflow/tools/pip_package:build_pip_package': error loading package 'tensorflow/tools/pip_package': Encountered error while reading extension file 'cuda/build_defs.bzl': no such package '@local_config_cuda//cuda': Traceback (most recent call last):
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 1528
_create_local_cuda_repository(repository_ctx)
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 1282, in _create_local_cuda_repository
_find_libs(repository_ctx, cuda_config)
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 866, in _find_libs
_find_cuda_lib("cublas", repository_ctx, cpu_value, c..., ...)
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 780, in _find_cuda_lib
find_lib(repository_ctx, [("%s/%s%s" % (bas...], ...)))
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 757, in find_lib
auto_configure_fail(("No library found under: " + ",...)))
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 348, in auto_configure_fail
fail(("\n%sCuda Configuration Error:%...)))
Cuda Configuration Error: No library found under: /usr/local/cuda-10.1/lib64/libcublas.so.10.1, /usr/local/cuda-10.1/lib64/stubs/libcublas.so.10.1, /usr/local/cuda-10.1/lib/powerpc64le-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x86_64-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x64/libcublas.so.10.1, /usr/local/cuda-10.1/lib/libcublas.so.10.1, /usr/local/cuda-10.1/libcublas.so.10.1
WARNING: Target pattern parsing failed.
ERROR: error loading package 'tensorflow/tools/pip_package': Encountered error while reading extension file 'cuda/build_defs.bzl': no such package '@local_config_cuda//cuda': Traceback (most recent call last):
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 1528
_create_local_cuda_repository(repository_ctx)
File "/<path>/tf/LinBuild/tensorflow/third_party/gpus/cuda_configure.bzl", line 1282, in _create_local_cuda_repository
_find_libs(repository_ctx, cuda_config)
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 866, in _find_libs
_find_cuda_lib("cublas", repository_ctx, cpu_value, c..., ...)
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 780, in _find_cuda_lib
find_lib(repository_ctx, [("%s/%s%s" % (bas...], ...)))
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 757, in find_lib
auto_configure_fail(("No library found under: " + ",...)))
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 348, in auto_configure_fail
fail(("\n%sCuda Configuration Error:%...)))
Cuda Configuration Error: No library found under: /usr/local/cuda-10.1/lib64/libcublas.so.10.1, /usr/local/cuda-10.1/lib64/stubs/libcublas.so.10.1, /usr/local/cuda-10.1/lib/powerpc64le-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x86_64-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x64/libcublas.so.10.1, /usr/local/cuda-10.1/lib/libcublas.so.10.1, /usr/local/cuda-10.1/libcublas.so.10.1
INFO: Elapsed time: 2.195s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
currently loading: tensorflow/tools/pip_package

So - let's start checking if i really do have libcudnn anywhere. I do the usual searches, and then start doing websearches on the errors.

Here is my repeated lesson to read all the instructions. cuda 10.1 isn't supported yet. I have spent a bunch of time trying cuda 10.1 builds. Nice, i should reward myself!

Back to harvesting libs - get cuda 10.0
(And i may have mentioned previously, I got the wrong cudnn lib origially, so I actually have that anyway).

Run the installs, rerun the configure, choosing the default cuda, not 10.1

It still doesn't find libcudnn.so.7

I highly recommend gdebi, it is a nice easy way to examine a deb package. I use that on the libcudnn7 installer, to confirm where it is putting the lib, and then set the actual path, rather than just assuming the default is good enough. (/usr/lib/x86_64-linux-gnu/ in case you are wondering).

I do that, rerun the ./configure and... it configs okay.

Let's try the magic:

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

to see what happens. It starts building. And continues building. And i write a bunch more of these (up to and including this one), plus watch some youtube vids and check to see how chaotic UK politics is today, as potentially the world ends tomorrow...

And eventually I look across at the other monitor and see:

INFO: Elapsed time: 6091.653s, Critical Path: 132.85sINFO: 16118 processes: 16118 local.INFO: Build completed successfully, 20095 total actions

So - you are caught up. Let's try installing it, shall we?

./bazel-bin/tensorflow/tools/pip_package/build_pip_package --nightly_flag /tmp/tensorflow_pkg

Looks like it works. Output wheel file is in: /tmp/tensorflow_pkg.

I make a copy of the wheel, as I've noticed I lost my previous wheels. Annoyingly.

To install to conda from a wheel, you just need to run:

pip install <path to whl file>

I run that for my build, and... it reports a successful install! How exciting.

Let's run a test:

import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
2019-03-28 17:54:34.225673: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3696000000 Hz
2019-03-28 17:54:34.226140: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x559f5af2d7f0 executing computations on platform Host. Devices:
2019-03-28 17:54:34.226151: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2019-03-28 17:54:34.227522: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-03-28 17:54:34.228209: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_SYSTEM_DRIVER_MISMATCH: system has unsupported display driver / cuda driver combination2019-03-28 17:54:34.228230: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:166] retrieving CUDA diagnostic information for host: Wednesday2019-03-28 17:54:34.228234: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:173] hostname: Wednesday2019-03-28 17:54:34.228270: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:197] libcuda reported version is: 410.104.02019-03-28 17:54:34.228283: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:201] kernel reported version is: 418.56.02019-03-28 17:54:34.228287: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:311] kernel version 418.56.0 does not match DSO version 410.104.0 -- cannot find working devices in this configuration
2019-03-28 17:54:34.228525: I tensorflow/core/common_runtime/direct_session.cc:316] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device

Sigh.

Playing with machine learning

29 March 2019

Getting tensorflow working - Part 3 - the GPU attempts

No comments:

Post a Comment

And now for a little Cthulhu