Playing with machine learning: March 2019

31 March 2019

Speeding up my CNN, on GPU

The story so far: I now have a new PC, self-compiled tensorflow gpu, and a working environment. I have some RGB leds in my PC and on my keyboard (speeds everything up, this is important).

I ran my model just on my CPU and it was taking ~570s per epoch.

On home built tensorflow it was taking 735 (i only had two datapoints, and I dont' think I had configured it).

With GPU it was taking 1009s

Running like that I and viewing nvidia-smi I get:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0  On |                  N/A |
|  8%   62C    P2    44W / 151W |   1291MiB /  8116MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1322      G   /usr/lib/xorg/Xorg                            26MiB |
|    0      1364      G   /usr/bin/gnome-shell                          51MiB |
|    0      1636      G   /usr/lib/xorg/Xorg                           413MiB |
|    0      1755      G   /usr/bin/gnome-shell                         224MiB |
|    0      3112      G   ...quest-channel-token=4965395672649938349   160MiB |
|    0     19665      G   ...than/anaconda3/envs/p36TFGJT/bin/python     3MiB |
|    0     19782      C   ...than/anaconda3/envs/p36TFGJT/bin/python   399MiB |
+-----------------------------------------------------------------------------+

htop gives me:

  1  [||||                                                                              3.3%]   4  [|||||                                                                             4.6%]
  2  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                       66.4%]   5  [||||                                                                              3.3%]
  3  [||||||||||||||||||||||||||||||||||                                               38.5%]   6  [||||                                                                              3.3%]
  Mem[||||||||||||||||||||||||||||||||                                           5.82G/31.3G]   Tasks: 207, 1066 thr; 2 running
  Swp[                                                                              0K/2.00G]   Load average: 1.00 2.10 3.08 
                                                                                                Uptime: 19:21:32

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
19782 me   20   0 18.8G 1785M  519M S 105.  5.6  0:48.31 /anaconda3/envs/p36TFGJT/bin/python -m spyder_kernels.console -f /run/user/1000/jupyter/kernel-8283c73bc975.json
19922 me   20   0 18.8G 1785M  519M R 99.0  5.6  0:40.12 /anaconda3/envs/p36TFGJT/bin/python -m spyder_kernels.console -f /run/user/1000/jupyter/kernel-8283c73bc975.json
 1755 me   20   0 4254M  435M 96436 S  3.9  1.4  5:13.84 /usr/bin/gnome-shell
19665 me   20   0 3068M  233M  107M S  2.0  0.7  0:04.11 /anaconda3/envs/p36TFGJT/bin/python /home/jonathan/anaconda3/envs/p36TFGJT/bin/spyder
 1636 root       20   0  619M  170M 85348 S  1.3  0.5  3:09.35 /usr/lib/xorg/Xorg vt2 -displayfd 3 -auth /run/user/1000/gdm/Xauthority -background none -noreset -keeptty -verbose 3
 2293 me   20   0  797M 46732 28624 S  1.3  0.1  0:05.04 /usr/lib/gnome-terminal/gnome-terminal-server
19893 me   20   0 18.8G 1785M  519M S  1.3  5.6  0:00.31 /anaconda3/envs/p36TFGJT/bin/python -m spyder_kernels.console -f /run/user/1000/jupyter/kernel-8283c73bc975.json
19898 me   20   0 18.8G 1785M  519M S  1.3  5.6  0:00.30 /anaconda3/envs/p36TFGJT/bin/python -m spyder_kernels.console -f /run/user/1000/jupyter/kernel-8283c73bc975.json
 3072 me   20   0 1526M  324M  120M S  0.7  1.0 10:02.91 /opt/google/chrome/chrome
18139 me   20   0 41792  5956  3880 R  0.7  0.0  0:16.52 htop
 3112 me   20   0  786M  270M 99900 S  0.7  0.8  4:48.11 /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=4918414677377325872,16218879521054795833,131072 --gpu-preference
19895 me   20   0 18.8G 1785M  519M S  0.7  5.6  0:00.29 /anaconda3/envs/p36TFGJT/bin/python -m spyder_kernels.console -f /run/user/1000/jupyter/kernel-8283c73bc975.json
19896 me   20   0 18.8G 1785M  519M S  0.7  5.6  0:00.29 /anaconda3/envs/p36TFGJT/bin/python -m spyder_kernels.console -f /run/user/1000/jupyter/kernel-8283c73bc975.json
19820 me   20   0 18.8G 1785M  519M S  0.7  5.6  0:00.22 /anaconda3/envs/p36TFGJT/bin/python -m spyder_kernels.console -f /run/user/1000/jupyter/kernel-8283c73bc975.json
19897 me   20   0 18.8G 1785M  519M S  0.7  5.6  0:00.83 /anaconda3/envs/p36TFGJT/bin/python -m spyder_kernels.console -f /run/user/1000/jupyter/kernel-8283c73bc975.json
 1993 me   20   0  789M 38508 29340 S  0.7  0.1  3:39.50 psensor
19894 me   20   0 18.8G 1785M  519M S  0.7  5.6  0:00.31 /anaconda3/envs/p36TFGJT/bin/python -m spyder_kernels.console -f /run/user/1000/jupyter/kernel-8283c73bc975.json
 3578 me   20   0  816M  163M 71896 S  0.7  0.5  2:19.69 /opt/google/chrome/chrome --type=renderer --field-trial-handle=4918414677377325872,16218879521054795833,131072 --service-pipe-toke
 1642 root       20   0  619M  170M 85348 S  0.7  0.5  0:04.79 /usr/lib/xorg/Xorg vt2 -displayfd 3 -auth /run/user/1000/gdm/Xauthority -background none -noreset -keeptty -verbose 3
19892 me   20   0 18.8G 1785M  519M S  0.7  5.6  0:00.07 /anaconda3/envs/p36TFGJT/bin/python -m spyder_kernels.console -f /run/user/1000/jupyter/kernel-8283c73bc975.json
19891 me   20   0 18.8G 1785M  519M S  0.0  5.6  0:00.29 /anaconda3/envs/p36TFGJT/bin/python -m spyder_kernels.console -f /run/user/1000/jupyter/kernel-8283c73bc975.json
 3240 me   20   0  763M  104M 64516 S  0.0  0.3  1:32.81 /opt/google/chrome/chrome --type=renderer --field-trial-handle=4918414677377325872,16218879521054795833,131072 --service-pipe-toke
19785 me   20   0 3068M  233M  107M S  0.0  0.7  0:00.17 /anaconda3/envs/p36TFGJT/bin/python /home/jonathan/anaconda3/envs/p36TFGJT/bin/spyder
 3093 me   20   0 1526M  324M  120M S  0.0  1.0  3:25.67 /opt/google/chrome/chrome
F1Help  F2Setup F3SearchF4FilterF5Tree  F6SortByF7Nice -F8Nice +F9Kill  F10Quit

I don't seem to be stressing my system, particularly. I want to stress it, it can work harder, dammit.

I need to read this:
https://stackoverflow.com/questions/41948406/why-is-my-gpu-slower-than-cpu-when-training-lstm-rnn-models

But also, this: https://medium.com/@joelognn/improving-cnn-training-times-in-keras-7405baa50e09
Has a quick fix (I like quick fixes!).

In my .fit_generator add:

use_multiprocessing=True,
workers = 6,

So let's try that. (The number of workers recommended is actually higher, but that is on Ryzen 7 with 16 threads available, I have an I5 with six threads available).

Lets try this.

nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0  On |                  N/A |
|  8%   62C    P2    47W / 151W |   1303MiB /  8116MiB |     11%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1322      G   /usr/lib/xorg/Xorg                            26MiB |
|    0      1364      G   /usr/bin/gnome-shell                          51MiB |
|    0      1636      G   /usr/lib/xorg/Xorg                           413MiB |
|    0      1755      G   /usr/bin/gnome-shell                         224MiB |
|    0      3112      G   ...quest-channel-token=4965395672649938349   172MiB |
|    0     20260      G   ...than/anaconda3/envs/p36TFGJT/bin/python     3MiB |
|    0     20376      C   ...than/anaconda3/envs/p36TFGJT/bin/python   399MiB |
+-----------------------------------------------------------------------------+

Definitely seems to be using more GPU
htop was hard to get a view of, because it was hammering all cores :) That's what I like to see.

Epoch time was 198s

However, keras does give me a warning:
UserWarning: Using a generator with `use_multiprocessing=True` and multiple workers may duplicate your data. Please consider using the`keras.utils.Sequence class.
UserWarning('Using a generator with `use_multiprocessing=True`'

Hmm, is this really faster, or just doing not so great a job?

Let's try a real run with a lot more epochs, and see what results I get. 1 3/4 Hours for 25 Epochs. Not too bad.

I really do need to start tweaking my model now though. Something for later, I guess.

30 March 2019

And we (may) have lift off / Houston we have a problem.

When last i wrote (last night), I'd run a successful tensorflow gpu build. I'd installed it in to a conda environment.

And on running I had a driver version mismatch. 418 gpu driver, 410 used everywhere else.

I didn't really look to fix that last night. I made a half-hearted attempt to install a default version again, just because I now knew I had all the supporting files, drivers, libraries.

None of that worked, obviously.

So I ate some junk food, watched some Expanse ("I'm that guy"!), wood working videos, went to bed. Got some sleep.

It's morning. Last workday of my weeks vacation. And I have some stuff to do today that isn't computer or youtube watching related.

I fired up my machine. Ensured i had the same error still, deleted some shoddy conda environments, did a little clean-up on the system and thought about it.

I need to fix the driver issue. Let's hiut a search engine with the "failed call to cuInit: CUDA_ERROR_SYSTEM_DRIVER_MISMATCH: system has unsupported display driver / cuda driver combination" error, and see what I see.

Going through the first hit (https://github.com/tensorflow/tensorflow/issues/19266), just following along for s&g, the responder asks for nvidia-smi output. Hmm, i wonder what that does currently show on my machine.

I run it, and... it errors. Wrong driver version. Oh FFS! Good start. Looks like some of my system is now 410 and some 418 (rather than just having some libs compiled against a different version).

Try reinstalling the 418 driver from nvidia. That errors (in use error).

So I decide i may as well power off / power on, just to see if my machine actually still works.

I do, it does. Yay.

Run nvida-smi and...

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   49C    P0    41W / 151W |    367MiB /  8116MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1289      G   /usr/lib/xorg/Xorg                            26MiB |
|    0      1331      G   /usr/bin/gnome-shell                          51MiB |
|    0      1602      G   /usr/lib/xorg/Xorg                           193MiB |
|    0      1719      G   /usr/bin/gnome-shell                          93MiB |
+-----------------------------------------------------------------------------+

So - it's using 410 now. Open nvida server settings. Driver version 410.
My nvidia drivers now all appear to be 410. Interesting.

Let's fire up a clean python 36 environment, install my tensorflow gpu build, and see what happens!

tf.test.is_gpu_available(

    cuda_only=False,

    min_cuda_compute_capability=None

)

Which outputs:

2019-03-29 10:54:29.739032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1674] Adding visible gpu devices: 0

2019-03-29 10:54:29.739060: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1082] Device interconnect StreamExecutor with strength 1 edge matrix:

2019-03-29 10:54:29.739064: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1088]      0

2019-03-29 10:54:29.739067: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1101] 0:   N

2019-03-29 10:54:29.739136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1222] Created TensorFlow device (/device:GPU:0 with 7238 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)

Out[8]: True

My, that looks interesting. It seems to be gpu'ing.

I can't run my cnn script yet, as I don't have keras in that env. Let's get myself up and running, and see if i can execute that script without errors!

conda install -c conda-forge keras

Does include the following

The following NEW packages will be INSTALLED:

  <snip>
  tensorflow         conda-forge/linux-64::tensorflow-1.1.0-py36_0

  <snip>

So - I may have some further work to do...

Yeah, everything fails horribly. Let's try installing my own tensorflow gpu build over the top. Just as well I copied it, tmp has deleted the version there So - where did I copy it too? Time to go hunting I guess. Note to self, get some order around where i store self-built tensorflow.

Bugger, bugger, bugger. When i run pip it just reports i already have tf, and does nothing.

Time to try --ignore-installed (on a cloned environment, obviously)

pip install --ignore-installed BackupTF/gpu_wheel/tf_nightly-1.13.1-cp36-cp36m-linux_x86_64.whl

Which in turn did a lot of churn on some things i know can be a bit... stroppy. (protobuf for eg, I've had issues with version before).

Well, it did run, lets try it. The tensorflow-gpu tests script I have seems to be happy. Yay.

My cnn script fails with a PIL.Image import issue (which I've seen recently anyway). Lets go and fix that.

just because I can't resist it, anger being an energy and all that:

pip install pillow

Conda env switch (to force an activate in the env I'm in). My tf gpu tests still pass.

But my cnn script fails witht he following. Still it's progress!
UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}}]]
[[loss/mul/_91]]

Let's try and fix that next shall we?

Long python scripts - How do i know when I'm done?

On running long processes, you probably aren't always sitting watching it, or sitting watching a film / eating dinner / going to visit friends (and hence just checking when you get back).

My take on it was to use IFTTT and its Maker channel.

My approach is write a small python script I can import or copy paste in to my longer running scripts/ What I'll do is message once a script starts, and then when it ends. You could do whatever you like within the confines of what IFTTT offers and services you have linked, but I'm going to send a message to slack.

My python code is:

import requests


sEventName = '<myeventname>'
a = 'String1'
b = 'String2'
c = 'String3'
sKey = '<mysecretkey>'



def IFTTT(sKey,sEventName,first, second, third):
    report = {}
    report["value1"] = first
    report["value2"] = second
    report["value3"] = third
    sTrigger = sEventName
    sURL = 'https://maker.ifttt.com/trigger/' + sTrigger + '/with/key/' + sKey
    print(report)
    print(sURL)
    #requests.post(sURL)
    requests.post(sURL, data=report)    



IFTTT(sKey,sEventName,a, b, c)

sEventName is the event name configured in IFTTT, sKey is my secret key (which obviously shouldn't be shared, so don't check it in to a public git repo...)

And when runs slack gets this message:

The event named "PythonNotification" occurred on the Maker Webhooks service String1 String2 String3 March 30 2019 at 10:46AM

You have quite some latitude for what is posted, and you would obviously change the strings.

Something like:
a = 'Started my CNN run'
b = 'Number of Epochs' + str(numEpochs)
c = 'for model My little CNN Model'

IFTTT(sKey,sEventName,a, b, c)
Then on completion
a = 'Completed my CNN run'
IFTTT(sKey,sEventName,a, b, c)

I think I first realised IFTTT had this capability after reading this: https://anthscomputercave.com/tutorials/ifttt/using_ifttt_web_request_email.html - that page goes in to detail on setting up IFTTT, so I don't feel the need to.

29 March 2019

Failed to get convolution algorithm

Running a cnn script for object notification ( a learning exercise on pictures of cats and dogs), which is what drove me to build a new pc in the first place, errors.

classifier.fit_generator(training_set,
                        steps_per_epoch=8000,
                        epochs=iEpochs,
                        validation_data=test_set,
                        validation_steps=2000,
                        callbacks = [tbCallBack]) # Last param is a late addition

Gives this:
Epoch 1/2
2019-03-29 12:33:09.921019: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-03-29 12:33:09.921056: E tensorflow/stream_executor/cuda/cuda_dnn.cc:337] Possibly insufficient driver version: 410.104.0
2019-03-29 12:33:09.921062: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-03-29 12:33:09.921070: E tensorflow/stream_executor/cuda/cuda_dnn.cc:337] Possibly insufficient driver version: 410.104.0

And a stacktrace, and then this

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_2/convolution}}]]
[[loss_1/mul/_185]]

So - my job for today is working out, like, wtf dude, to try next. As it were.

(Yes - it is only two epochs - that's just while I am checking to see if it works, once it's running that goes way up, which is why I've parameterised the number.)

let's start looking here:
https://github.com/tensorflow/tensorflow/issues/24828]

There seems to be a suggestion to add:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

I'm using tf through keras, so not sure this will help, but let's try
Still doesn't work.

(p36TFGJT) jonathan@Wednesday:~$ nvidia-smi
Fri Mar 29 12:58:16 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   46C    P8    17W / 151W |   1280MiB /  8116MiB |     21%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1289      G   /usr/lib/xorg/Xorg                            26MiB |
|    0      1331      G   /usr/bin/gnome-shell                          51MiB |
|    0      1602      G   /usr/lib/xorg/Xorg                           497MiB |
|    0      1719      G   /usr/bin/gnome-shell                         203MiB |
|    0      3783      G   ...equest-channel-token=798233842220658081   197MiB |
|    0      5478      G   /usr/bin/nvidia-settings                       0MiB |
|    0      5925      G   ...-token=A5215F1CE4347817C36139407E5E1125    58MiB |
|    0     11552      G   ...than/anaconda3/envs/p36TFGJT/bin/python     3MiB |
|    0     11675      C   ...than/anaconda3/envs/p36TFGJT/bin/python   235MiB |
+-----------------------------------------------------------------------------+
(p36TFGJT) jonathan@Wednesday:~$ dpkg -l | grep -i cudnn
ii  libcudnn7    7.5.0.56-1+cuda10.0    amd64        cuDNN runtime libraries
ii  libcudnn7-dev    7.5.0.56-1+cuda10.0    amd64        cuDNN development libraries and headers
(p36TFGJT) jonathan@Wednesday:~$

In case that's of interest (it kind of is to me).

This dude has what sounds like the sam eproblem as me:
https://github.com/tensorflow/tensorflow/issues/22056#issuecomment-470749095
https://github.com/tensorflow/tensorflow/issues/22056#issuecomment-471091775

So - the solution there was:

I found this, after all, to work for me :
-from software & updates > Additional Drivers > choose nvidia-418(downloads and install it)
reboot PC
As result got upgraded to cuda-10.1 (from 10.0). It works for now !

I'm not sure i want to try this though... First off, let's do a full back up (using Deja Dup, back in time didn't really work out for me).

But let's do it. The backup completed, the update ran. I'm back on 418 drivers.

Restart, run GPU tests. Close / reopen spyder.

Results are in:

Using TensorFlow backend.
2019-03-29 14:32:24.335488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1674] Adding visible gpu devices: 0
2019-03-29 14:32:24.335517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1082] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-29 14:32:24.335521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1088] 0
2019-03-29 14:32:24.335524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1101] 0: N
2019-03-29 14:32:24.335761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1222] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7262 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)

About to start model training
At: (2019, 3, 29, 14, 32)
Epoch 1/2
2019-03-29 14:32:32.916800: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-03-29 14:32:33.184104: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
8000/8000 [==============================] - 1009s 126ms/step - loss: 0.4253 - acc: 0.7970 - val_loss: 0.6266 - val_acc: 0.7519
Epoch 2/2
8000/8000 [==============================] - 1012s 126ms/step - loss: 0.1642 - acc: 0.9349 - val_loss: 1.0975 - val_acc: 0.7356

Finished in 2028.7929067611694
Finished at: (2019, 3, 29, 15, 6)

Model time was 2021.7227563858032

Yay, it ran on the GPU.

Hmm, it was far slower than my CPU. Time to ponder what's up.

Tensorflow GPU Checks

Once we have, think we have, suspect we have, hope we have or are otherwise inclined to think we may have tensorflow GPU support, it's time to check. The below are snippets that may help us check:

# One git issue indicated this may be required. I lost the issue ID

import os 

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"]="0" 

# import tensorflow
import tensorflow as tf

# Check no.1
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

# Check no.2
with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print (sess.run(c))

# Check no.3
# Built in - this should probably be check no. 1
tf.test.is_gpu_available(
    cuda_only=False,
    min_cuda_compute_capability=None
)

Results I get are:
Check no.1
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
2019-03-28 20:54:39.026230: I tensorflow/core/common_runtime/direct_session.cc:316] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device

Hmm, CPU not GPU. Fail. We expect all of them to fail, but they are vaguely interesting.

Check no.2
<long stack trace>
InvalidArgumentError: Cannot assign a device for operation MatMul: node MatMul (defined at <ipython-input-7-5d3b23a68111>:4) was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0 ]. Make sure the device specification refers to a valid device.
[[MatMul]]

Errors may have originated from an input operation.
Input Source operations connected to node MatMul:
a (defined at <ipython-input-7-5d3b23a68111>:2)
b (defined at <ipython-input-7-5d3b23a68111>:3)

Check no.3
False

So - for me at least, tensorflow isn't using the GPU. Pah!

I actually already knew this, as i get a driver mismatch error. This is just here in case it's useful.

Added Later
Once my home built tensorflow gpu libs are built and up and running this is what i get, if you wish to compare:

import os 

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" # see issue #152 
os.environ["CUDA_VISIBLE_DEVICES"]="


import tensorflow as tf

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print (sess.run(c))

tf.test.is_gpu_available(
    cuda_only=False,
    min_cuda_compute_capability=None
)

Returns:

Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1
[[22. 28.]
[49. 64.]]

2019-03-29 12:28:30.622272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1674] Adding visible gpu devices: 0
2019-03-29 12:28:30.622303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1082] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-29 12:28:30.622308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1088] 0
2019-03-29 12:28:30.622311: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1101] 0: N
2019-03-29 12:28:30.622403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1222] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6722 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-03-29 12:28:30.622532: I tensorflow/core/common_runtime/direct_session.cc:316] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1

2019-03-29 12:28:30.625908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1674] Adding visible gpu devices: 0
2019-03-29 12:28:30.625938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1082] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-29 12:28:30.625943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1088] 0
2019-03-29 12:28:30.625946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1101] 0: N
2019-03-29 12:28:30.626034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1222] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6722 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-03-29 12:28:30.645547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1674] Adding visible gpu devices: 0
2019-03-29 12:28:30.645575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1082] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-29 12:28:30.645579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1088] 0
2019-03-29 12:28:30.645582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1101] 0: N
2019-03-29 12:28:30.645645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1222] Created TensorFlow device (/device:GPU:0 with 6722 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
Out[10]: True

Bolding added just to highlight what I was looking for. While I'm in a musical mood - it turns out I have found what I was looking for.

Getting tensorflow working - Part 3 - the GPU attempts

It's the next morning, I've had some coffee, I've lounged watching some youtube cat videos, or dashcam footage, or something equally stupid. Why would i want to watch updates on the collapse of UK politics (it's brexit week...), when I can watch kittens.

Let's try getting GPU support going.

I have all the libraries I need.

I reread the gpu install page: https://www.tensorflow.org/install/gpu

And i decide to try the tensorflow-gpu build that's already available:

pip install tensorflow-gpu

(In a cloned environment)

That seems to go reasonably well. I go through the previous dance of reinstall keras, deactivate / activate environments. I run my script, and I've got everything fixed,

I get various cuda errors, but my script runs. It runs slower than CPU only. Again, not by much, but slower.

back to google, and git issues to troll through. There are various import os / os.environment items I need to set.

I set them.

Then I find I have them wrong, so I fix them.

And it is still no faster.

After a few circles of that particular merry go round I am back to thinking I may as well actually try building it. Just fix last nights issues, and build.

I go through the build again, with nothing changed, just to check I get the same issues. Which are:

Starting local Bazel server and connecting to it...
WARNING: The following configs were expanded more than once: [cuda]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
ERROR: Skipping '//tensorflow/tools/pip_package:build_pip_package': error loading package 'tensorflow/tools/pip_package': Encountered error while reading extension file 'cuda/build_defs.bzl': no such package '@local_config_cuda//cuda': Traceback (most recent call last):
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 1528
_create_local_cuda_repository(repository_ctx)
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 1282, in _create_local_cuda_repository
_find_libs(repository_ctx, cuda_config)
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 866, in _find_libs
_find_cuda_lib("cublas", repository_ctx, cpu_value, c..., ...)
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 780, in _find_cuda_lib
find_lib(repository_ctx, [("%s/%s%s" % (bas...], ...)))
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 757, in find_lib
auto_configure_fail(("No library found under: " + ",...)))
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 348, in auto_configure_fail
fail(("\n%sCuda Configuration Error:%...)))
Cuda Configuration Error: No library found under: /usr/local/cuda-10.1/lib64/libcublas.so.10.1, /usr/local/cuda-10.1/lib64/stubs/libcublas.so.10.1, /usr/local/cuda-10.1/lib/powerpc64le-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x86_64-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x64/libcublas.so.10.1, /usr/local/cuda-10.1/lib/libcublas.so.10.1, /usr/local/cuda-10.1/libcublas.so.10.1
WARNING: Target pattern parsing failed.
ERROR: error loading package 'tensorflow/tools/pip_package': Encountered error while reading extension file 'cuda/build_defs.bzl': no such package '@local_config_cuda//cuda': Traceback (most recent call last):
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 1528
_create_local_cuda_repository(repository_ctx)
File "/<path>/tf/LinBuild/tensorflow/third_party/gpus/cuda_configure.bzl", line 1282, in _create_local_cuda_repository
_find_libs(repository_ctx, cuda_config)
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 866, in _find_libs
_find_cuda_lib("cublas", repository_ctx, cpu_value, c..., ...)
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 780, in _find_cuda_lib
find_lib(repository_ctx, [("%s/%s%s" % (bas...], ...)))
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 757, in find_lib
auto_configure_fail(("No library found under: " + ",...)))
File "/<path>/tensorflow/third_party/gpus/cuda_configure.bzl", line 348, in auto_configure_fail
fail(("\n%sCuda Configuration Error:%...)))
Cuda Configuration Error: No library found under: /usr/local/cuda-10.1/lib64/libcublas.so.10.1, /usr/local/cuda-10.1/lib64/stubs/libcublas.so.10.1, /usr/local/cuda-10.1/lib/powerpc64le-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x86_64-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x64/libcublas.so.10.1, /usr/local/cuda-10.1/lib/libcublas.so.10.1, /usr/local/cuda-10.1/libcublas.so.10.1
INFO: Elapsed time: 2.195s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
currently loading: tensorflow/tools/pip_package

So - let's start checking if i really do have libcudnn anywhere. I do the usual searches, and then start doing websearches on the errors.

Here is my repeated lesson to read all the instructions. cuda 10.1 isn't supported yet. I have spent a bunch of time trying cuda 10.1 builds. Nice, i should reward myself!

Back to harvesting libs - get cuda 10.0
(And i may have mentioned previously, I got the wrong cudnn lib origially, so I actually have that anyway).

Run the installs, rerun the configure, choosing the default cuda, not 10.1

It still doesn't find libcudnn.so.7

I highly recommend gdebi, it is a nice easy way to examine a deb package. I use that on the libcudnn7 installer, to confirm where it is putting the lib, and then set the actual path, rather than just assuming the default is good enough. (/usr/lib/x86_64-linux-gnu/ in case you are wondering).

I do that, rerun the ./configure and... it configs okay.

Let's try the magic:

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

to see what happens. It starts building. And continues building. And i write a bunch more of these (up to and including this one), plus watch some youtube vids and check to see how chaotic UK politics is today, as potentially the world ends tomorrow...

And eventually I look across at the other monitor and see:

INFO: Elapsed time: 6091.653s, Critical Path: 132.85sINFO: 16118 processes: 16118 local.INFO: Build completed successfully, 20095 total actions

So - you are caught up. Let's try installing it, shall we?

./bazel-bin/tensorflow/tools/pip_package/build_pip_package --nightly_flag /tmp/tensorflow_pkg

Looks like it works. Output wheel file is in: /tmp/tensorflow_pkg.

I make a copy of the wheel, as I've noticed I lost my previous wheels. Annoyingly.

To install to conda from a wheel, you just need to run:

pip install <path to whl file>

I run that for my build, and... it reports a successful install! How exciting.

Let's run a test:

import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
2019-03-28 17:54:34.225673: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3696000000 Hz
2019-03-28 17:54:34.226140: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x559f5af2d7f0 executing computations on platform Host. Devices:
2019-03-28 17:54:34.226151: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2019-03-28 17:54:34.227522: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-03-28 17:54:34.228209: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_SYSTEM_DRIVER_MISMATCH: system has unsupported display driver / cuda driver combination2019-03-28 17:54:34.228230: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:166] retrieving CUDA diagnostic information for host: Wednesday2019-03-28 17:54:34.228234: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:173] hostname: Wednesday2019-03-28 17:54:34.228270: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:197] libcuda reported version is: 410.104.02019-03-28 17:54:34.228283: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:201] kernel reported version is: 418.56.02019-03-28 17:54:34.228287: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:311] kernel version 418.56.0 does not match DSO version 410.104.0 -- cannot find working devices in this configuration
2019-03-28 17:54:34.228525: I tensorflow/core/common_runtime/direct_session.cc:316] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device

Sigh.

Getting tensorflow working - Part 2

A week later, we get in to tensorflow again.

Before updating anything I look for reasonable back up solutions. I tried 'Back In Time', which looked like it should do what I want, but... that throws a bunch of errors for me. Possibly because mostly people would want to back up their docs, whereas I'm okay leaving them (the ones i want are already backed up to two other locations), I just want a record of my OS. back in time doesn't seem to really like that.

No matter, I have a week stretching in front of me, with not much I *have* to do, so I decide to go for it. (A lot more carefully than last time).

First - conda environments. I have been very cavalier about virtual envs previously, Not now.

conda create --name python36 python=3.6

(There is a good conda cheat sheet pdf linked to from here which I found quite helpful, just while i got in to the swing of things).

From there a tensorflow (CPU only) and keras install on that environment.

All still works. Times are good (for CPU) again. I do get informed that there are a few options my CPU supports that the binary I'm using doesn't support. They woudl maybe speed things up a bit, and now I have virtual environments sorted out, maybe building tensorflow from source is a good next step on the way to getting a working GPU build!

Building tensorflow.
Basically *all* you need to do is follow these instructions: https://www.tensorflow.org/install/source
How hard can it be? Well, not that hard. I mean, be careful when you check out the source, as someone may have recently checked in something that has problems.

Yeah, I got that. It's frankly sub-optimal.

Things to take in to account:
If you don't already have bazel set up, you'll need to get that. It's not complicated or anything, it's just one more thing to get up and running.
Tensorflow / keras currently supports Python 3.6 but NOT python 3.7. I did know this before the build, hence why I've prepped a 3.6 conda env, and why I've been using that for my standard install checks.
before you install your own tensorflow build, it's probably worth cloning your existing env, or just creating a new one.
If you run the install under a python 3.7 install everything will work until you try and run a pip install in your python 3.6 environment, at which point you'll be told where to go with your dodgy build. (The clue is the cp indicator in the pip wheel name, in case you hit a similar issue with some future version incompatibility)/ Yeah yeah, I knew I needed to run under python 3.6, I just had a bit of a brain freeze, and didn't actually think about having to build under 3.6. Note for future self - build with the python version you are going to run.

The initial build went okay, but the build-pip steps failed. It turns out that there was some new code that didn't include the required import to work. It was fairly easy to fix, and whilst I *really* wanted to commit (ha, spell check just wanted to correct that to vomit, how apropos) a fix to git,
a) I would need to go through a whole approval process (I think) but most importantly
b) someone beat me to it.

The build for me took >15 mins <1 1/2 hours (I went and watched a film while it was running).

Anyway, all built, it should now work. Fire up spyder, run my cnn script and... it failed.
Install keras, refesh the environment (I suspect I don't need to do that, but it makes me feel better to activate a different one, and then activate my actual one again), and retry.
Nope, keras now wants to use theanos.

<anaconda3>/envs/<env name>/etc/keras/load_config.py

Check what backend is set to. I think I've seen a couple of variations of this, one of which would give tensorflow to macs and theano to linux. Just make sure you have tensorflow, and especially for linux when there happens to be an if.

Everything is now hopefully fixed, and I deactivate / reactivate my environment, fire up spyder and run my script.

Yay, it works. Though it is actually slower now, as it is missing some XLA settings (not much slower, but slower regardless).

Regardless - I rebuilt tensorflow from scratch, including writing a fix to a buggy checkin, and fixed my env and it eventually works. That's enough for me in one evening.

Though...

I have the instructions for building with GPU support open. And I've got the build environment all warmed up (yeah, I know it doesn't work like that), and there is probably another film I could go and watch... hey, it's not like it's late, actually by now it's actually really really early, but I'm not workign tomorrow. Let's go for it.

I followed the same principle: follow the isntructions at https://www.tensorflow.org/install/gpu as closely as I can, don't improvise. What could go wrong?

Follow the instructions to get the various nvidia binaries I need. I actually already have these from my first foray a few weeks ago, but I go to the downloads and repeat again just to check it would eb the same files (as soon as I get asked whether I should overwrite, I kill the download).

I choose cuda 10.1 and cudnn 7.5 for cuda 10.1 (when I firts went through this, I didn't check I was getting the correct cudnn, so i got the lib for 10.0).

Put everything in it's place, run the installs, clean bazel. Run the config and... it doesn't find libcudnn

At this point i decide maybe it's time to get some sleep.

We'll got for a tensorflow-gpu build tomorrow, in Part 3.

Getting tensorflow working. Part 1.

And now I am ready to try tensorflow in ubuntu.

First, I wanted to try just on my cpu. Try and burn off that new cpu smell :)

pip install tensorflow
conda install -c conda-forge keras

Start spyder, run my cnn script and... it works. It's a lot faster than on my Mac, but then it should be.

My cpu doesn't get too toasty, hits about 53C. I ran with just a few epochs, and it's noticeably faster (should bring time down by about 2/3 - which seems to make sense, as I'm on 6 core rather than 2).

Now lets try installing tensorflow-gpu.

First off is a graphics driver update. From the standard ubuntu ones to nvidia's. That seems to go reasonably well.

Next, follow instructions at the tensorflow website.

For whatever reason (that was a couple of weeks ago, and i can't remember details), that didn't get me far. It was getting late on Saturday and I only had one night and one day. As a bit of relaxation I was browsing the web, and found a company that has a tf-gpu stack available. A reasonably well known company (I won't say who, as i had a bad experience, and I suspect it was me rather than them, so we'll skip who they are).

I downloaded and ran their docker image. It seemed to do a lot, including installing new nvidia drivers. But that's all in the docker image right, that's to be expected, right? Right?

The update completed, and I rebooted (either it asked me to, or I just thought it would be a good idea). And here is when it started to go wrong. My machine wouldn't boot correctly, it would crap out when displaying started bpfilter. This appears to be quite a common problem with Ubuntu 18.10 upgrades. I've not upgraded to 18.10, I'm sticking to 18.04.

There are quite a few workrounds for this, most of which I tried. I'd boot to safe, apply the fix. reboot. Get the same hang. Also - video only seemed to be showing on my intel built in graphics, nothing from the nvidia card.

I spent all day Sunday trying to resolve my issues, to no avail. Whatever I had done to my machine, it was beyond my ken to resolve.

And, to be honest, seeing as it was a fairly clean install, I maybe should have abandoned it before i did. I know it's good to be able to fix things, but the fix here would be to go back in time and tell myself not to willy-nilly install other peoples tf-gpu stacks.

I abandoned all hope, and ditched it.

The only thing about their stack that really annoyed me - they put their logo on my grub loader. That's just shitty. Everything else I'm more than willing to accept was me blundering through something i don't realise the consequences of, but messing with someones boot loader is just unnecessary and petty, in my opinion. If they want to advertise their stack (which I guess is why you do the image thing), then put chuck a message on the docker container somewhere. That seems far more reasonable, and would be perfectly acceptable. This was possibly exasperated by the fact that grub seemed to do weird things on this machine anyway.

Wait a week (work get's in the way). At least I have Win10, which I can use for remote access to work (the primary reason I have a Win10 install on here).

Next weekend I reran the ubuntu installer, resized my 250GB nvme drive paritions. My complete previous install including any parts of home on that drive was <50GB, so I downsized that to 50GB, and created a new 200GB partition.

Install Ubuntu on to the 200GB partition, and mount the 50GB as /home. It's a less than ideal approach, but for now it'll work.

Anyway - the install went reasonably well. I had to troll through the setup I'd done previously, most of which I'd forgotten, but I just fixed things in the order they annoyed me, or I wanted them (cpu temps and similar).As i only had one more week before a week off, I decided not to mess with tensorflow, I just concentrated on getting as much as possible working again.

Next up - Part 2

28 March 2019

Windows Install

Everything is built, and it boots.

The plan is:
Win10 Pro: 500GB M2 SSD
Ubuntu 18.04: 250GB M2 NVME
Docs and near time info: 1TB SSD
Large files I rarely need: 3TB Toshiba HDD

It's midnight on a Friday. Let's install Windows 10!

I already prepared a USB stick with the Windows installer, the machine is in the living room, so I grabbed a keyboard and mouse, plugged the machine in to the TV, and set everything going.

Windows was surprisingly fast and straight forward to install. I know it's never been a particularly unpleasant install, but i remember it taking way longer. Possibly that's down to the SSD, CPU and ram improvements since i last installed. I used to manage a test lab with plenty of machines, and plenty of OS variants (back in the OS/2 warp and NT4 days, I guess that would have been just before Win2K) - but I either prepared install scripts and let them run (NT4), or used ghost to drop clones on to machines. They were pretty quick installs, but just a vanilla Win10 Pro install seems about as straightforward nowadays.

Windows installed, everything seemed to work. Update to Chrome, grab Anaconda and run a couple of training items. Everything seems to be in order and mighty snappy. Time to sleep!

(The next posts I've written are getting tensorflow running - that is all on Ubuntu. I did the Ubunut install the next day, and it went smoothly enough. I may throw upa short post on that at some point, but more likely on my initial thoughts of Ubuntu.)

The parts

Here is what I planned, and what I got.

Motherboard: Asus ROG Strix Z390-E

Pretty much planned and got. Has two M2 sockets, is Z390, should give me some breathing space in future.

CPU:

Planned a better core i3.

Changed: to a Pentium Gold (2 cores, but 4 threads).

Got: Core i5-9600K (that gives me some overclock possibility. And i think it will be years before anything better comes down in price significantly. So it should be okay for a good few years.)

CPU Cooler:
Planned: A big ole Noctua cooler, because i think they look cool :)
Got: A CoolerMaster ML240 AIO. I'm always worried that big cpu coolers will fall off! Plus, I've never had a watercooled machine before, and the AIO was stupid cheap. Plus it's got RGBY'all
(Note - if we get what we pay for, I'd best watch those temps, and keep an eye out for leaks).

GPU:

Planned: Two Zotac 1060s.

Changed: One Zotac 1060 (get a second in the future, when i have everything running, and they are maybe cheaper.

Got: Zotac 1070. I worried about upgrading the 1060 later, and the 1070 wan't much more.

Memory:
Planned: 16GB
Got: 32GB. I didn't want to get two 8GB sticks (if I'm going to max the memory in a year, I don't want to be wasting DIMMS). One 16GB stick just seemed wrong. So 32 in 2*16. NOTE - I got 2666MHZ memory, but it reports as 2111MHZ (or something). It looks like all DDR4 apparently runs at that speed. I needed to go in to my bios and over-ride it. I did that and it all seems stable. Assuming it did actually work - I should check.

Disks:
Got what i planned for.
M2 NVMe (PCIE) 250GB. Stupid fast, to run Ubuntu from
M2 SSD: 500GB. May as well use the slot. To install Win10
SSD 1TB: A nice bit of reasonably fast shared storage.
HDD: 3TB 7200. I would have gone 5400, I don't want to use it much. But the 7200 was only about £5 more. This gives me a big chunk of space to store big files / rarely used files and similar. (And a few weeks later - i rarely hear it spin up, well, I've never heard it spin up - but it does get used. If i need significantly more space I'd get another one as far as noise is concerned.)

Power:
Be Quiet! 750w Straight power. (planned, got). It's efficient, enough power, quiet.

Case:
Planned: Pretty much every case in production.
Got: Be Quiet Silent Base 801. I'm not likely to be pushing thermals too much, so a quiet case seemed a reasonable idea. It's big and spacious, has a glass side (RGBY'all)
Wish I'd got, maybe: Lian Li PC 011. I saw a review of this once I'd bought the BQ 801. if i'd seen that before, I'd have probably got it. I actually love their sit/stand table PC case too :) But that price...
Want to get: I actually want to build a case from scratch at some point. Pipe dreams.
(A few weeks later - I like this case - so far no heat issues, and it is lovely and quiet, which counts as it sits beside me on my desk.)

Others:
Keyboard: Drevo Tyrfing v2. I got blue switches by accident. Don't get blue switches if you're an aggressive typist. I like it, plus RGBY'all bling. It's pretty.
(A few weeks later - I like this keyboard. But I'd definitely get quiet switches in future. This is loud as a loud thing.)

Second monitor. I'm used to Laptop + monitor (screen and small screen), or dual (or more) monitors at work. I got a flimsy cheap AOC 22" monitor as a second screen. (Two weeks later - the AOC works, but is a bit of a '57 chevy, as it were. To be honest I should have just bought a widescreen, there are some for not that much more than this thing, and a monitor stand for my old viewsonic to hold it in portrait mode - possibly an upgrade before end of year.)

Building it: Maybe the subject of another post. I wish I'd taken pics at various points. I haven't built a PC for around 16 years, but it all seemed to go quite well. Took me way longer than it should, and my cable management leaves a lot to be desired.

Some notes i did make on pc parts picker:
Everything went together fairly easily. I haven't built a PC for well over 15 years, so it's not like i have recent experience :)

The CM AIO was a bit of a bugger - getting that back plate to stay in place while I screwed the front on was a right pain. I ended up using a piece of foam from one of the packages to hold it against the motherboard, and that worked quite well.

The rad is mounted to the top. HDD is in the basement, SSD on the back of the MB plate. Cable management is a bit iffy, but that's down to my inexperience. I did buy 100 velcro zip ties from amazon for about £4 - well worth it. The only real messy cable that is visible is the GPU power cable.

I'm not using the system for gaming, but so far when I've pushed the CPU (not overclocked, yet), the temps haven't been at all bad. That was my primary worry with this case, and so far so good. In worst case I can very easily take the front off to help with air intake.

ML on my ole MacBook Pro, and a photography anecdote.

So, i installed keras on my macbook, and built and ran a CNN.

14 hours later... It failed with an error I lost.

This is less than ideal!

Also, my MacBook is now starting to show it's age (though it's no slouch - Mid 2012 Retina, core i7 dual proc, 16GB ram) - it's just getting a bit long in the tooth.

I later realised the error was nothing to do with tensorflow or keras, but was down to me missing () from time.time() when i was checking how long the cnn train took. Oops.

Still 14 hours when I'm trying to learn about CNNs isn't good. It reminds me of a point someone once made about photography. If you want to improve your photography skills and you are using a film camera, then you need to take a lot of photo's and copious notes on each of them. You then develop them (either at home or send to a lab, either way there's a significant delay).

You get your prints, and then compare each one to your notes. What was the aperture, the ISO, where was the light source and how strong? What was my focus point, and how deep was my focus? What film was I using and do the negatives match the prints? Also - what was i actually trying to do, beyond all the technical stuff. What was the intent?

With a digital camera you either:
- Set aperture, iso, anything else you want to tweak. Focus where you want. Take a photo. Look at it. And possibly pull up all the details of the settings from the image itself. If you want more control you has the camera set to raw, and converted to jpeg, and you can tweak that (just as you would negative -> print). You probably remember the intent because it wasn't that long ago you took the photo.
OR
You set the camera to auto (because it is probably better than you in most circumstances), frame your image. Take the photo. Look at it. (I once when on a creative photography course run by a very good, and creative, fashion photographer. For most of the homework sessions we had his instructions where to leave the camera in auto and concentrate on framing your image - a modern camera is better than you at getting things right, at least until there are specific images and effects you want to achieve).

Anyway, large digression. Waiting a day to see what happened is not as good as waiting till a bit later. Especially if the only time you really get to study is the weekend.

So - let's see if we can do something else.

Planning a new PC

Here we are then: I want something a bit faster than my much loved MacBook. I'd love to get an iMac Pro. But... I don't have >£10k spare (because it would obviously have to be the best iMac pro i could spec).

I then looked at laptops. Maybe update my MacBook Pro to a new one. So... that would be nice. I looked at the spec I could realistically get. The main problem is I'd want more memory than the laptop I have now, which means opting for a pretty expensive laptop. Which wouldn't be the end of the world, but I suspect I'd then want an external GPU, and that adds extra cash.

Next stop - what if i bought a non-Apple gaming laptop. Similar kinds of prices, but I'd get a pretty good gpu. And I'd really quite like one of those new Razor Blade gaming laptops, they seem a lot more portable than the previous gen gaming laptops (two power supplies, 100kg, size of a small truck, chucks out more heat than our wood burner).

And that obviously leads to - what about a non-laptop gaming PC? Got to be cheaper, and more powerful. And i can always just either remote in, or use Jupyter remotely. I could also use spyder with a remote kernel, which looks pretty cool.

Looking at pre-built and store custom PCs they don't look like that good an option. Or, to be more correct, that economic an option. Getting just what I'd want configured is not so easy, and when i can configure something acceptably, then the price is unreasonable.

I remember the last time i wanted a PC (hmm, 15 or 16 years ago), I had the same problem. My issue then was building a media PC - I actually wanted mostly low end specs to reduce power draw, but good memory, and large fast hard disks. All in a nice case. It was something no-one at the time seemed to want to configure. Back then I just built what I wanted, with a pretty nice TV capture card. Eventually i decided i didn't really need the capture card as rarely watched TV, and replaced it with a Raspberry Pi to serve mostly as a media player. I'm not sure what happened to that RPi, lost in a move I guess.

Hmm. a Raspberry Pi probably isn't going to serve as a good tensorflow accelerator, so it's build time again.

So - the starts-fun-becomes-a-nightmare-ends-desperation process of picking parts for a self build computer begins.

I'm sure everyone knows this, but pc part picker is awesome for checking comparability and looking at price.

What are my goals for a new computer?

Must have room for expansion.
Must have a GPU.
Doesn't need to be cutting edge, but needs the ability to add 'bluntish edge' components in a years or so. I.e when that core i9 18 core processor is a bit cheaper, I want to be able to shove that in.
It will be my main desktop computer.
It won't just do hobby dev and ML type stuff.

Reading up various sources on ML beginner machines, the general view was you don't need a super powerful CPU, you don't need gobs of memory. A good GPU (nvidia), and disk space if you are going to be working with data are needed.

What did i end up with?