Semi-OT: Hardware: NVidia Proprietary Driver, C7.4

Home » CentOS » Semi-OT: Hardware: NVidia Proprietary Driver, C7.4

September 26, 2017 CentOS 21 Comments

This is really frustrating. I’ve got a server with two K20c Tesla cards. I
need to use the proprietary drivers to use the CUDA toolkit. Btw, I had no trouble at all with building for CentOS 7.3

I have what NVidia claims is the correct driver package, a 340 series. It appears to build, but then fails to load. The only error I see is “no such device”, which makes no sense to me, esp. since it says nothing whatever else.

I’ve gone through the install log, and there are a bunch of Note:, and warnings, but the later I think are all about comparing signed and unsigned integers.

And lsmod shows no nvidia drivers registered, but the logs claims that Error: Driver ‘nvidia’ is already registered, aborting…

Anyone got any ideas?

mark

21 thoughts on - Semi-OT: Hardware: NVidia Proprietary Driver, C7.4

Scott Robbins says:

September 26, 2017 at 12:59 pm

Why not use the elrepo repo? They’ve worked flawlessly for me, both with legacy and new cards.
Phelps, Matt says:

September 26, 2017 at 1:19 pm

Seconded. We use the elrepo repository for hundreds of workstations and have had no issues. Takes care of everything automatically.
Phil Perry says:

September 26, 2017 at 2:45 pm

You don’t say which version of the 340 series driver you have tried.

There was a bug with recent legacy releases that affected el7.4 kernels. We (elrepo) patched the driver to fix that on rhel7.4 releases. I’m not sure but it _may_ have been fixed in the 340.104 driver released last week – I’ve not bothered building it as the changelog only mentions
“Improved compatibility with recent Linux kernels” which we patched/fixed in our the previous release and other issues which don’t affect kmods on RHEL.

So it sounds like a known issue which has already been fixed. If you don’t want to use our packages, maybe take a look at the patch and try applying it to your build.
Pete Biggs says:

September 26, 2017 at 3:29 pm

Yes, but these are Tesla cards with the CUDA toolkit – I’ve never got the elrepo versions to run work properly when developing CUDA
applications.

P.
Pete Biggs says:

September 26, 2017 at 3:32 pm

Have you tried installing the toolkit from nVidia’s own repository:

https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro
vychytraly . says:

September 26, 2017 at 4:47 pm

From my experience elrepo nvidia drivers work fine with CUDA packages from nvidia repository
Niki Kovacs says:

September 26, 2017 at 4:47 pm

Le 26/09/2017 à 19:59, Scott Robbins a écrit :

I know this is weird, but I’ve had cases where the downloaded NVidia driver worked and the ELRepo driver didn’t, and the other way around.

Details here: https://blog.microlinux.fr/nvidia-CentOS/

Niki

—
Microlinux – Solutions informatiques durables
7, place de l’église – 30730 Montpezat Web : http://www.microlinux.fr Mail : info@microlinux.fr Tél. : 04 66 63 10 32
Sorin Srbu says:

September 27, 2017 at 1:54 am

After upgrading from 7.3 to 7.4 the GUI won’t start anymore. Using and Nvidia GTX260 with the elrepo drivers. Am investigating, but so far zip…

Is the article above provided available in English?

—
//Sorin
Sorin Srbu says:

September 27, 2017 at 1:56 am

I
no It such

Tested 340.76, 340.102, 340.104 (elrepo and proprietary). No luck over here with a GTX260 and the 64b-drivers.

Will test some more, if still no luck, I’ll just reinstall from scratch.
says:

September 27, 2017 at 10:54 am

Hi, folks,

Well, still more fun (for values of fun approaching zero):

1. Went to install CUDA 9.0… well, gee, there is *no* CUDA 9.0.
Even though I installed the 9 repo, all that I get is 8. I’ve
used their webform, and an waiting on a reply.
2. I remove all nvidia packages.
3. It appears that the kmod-nvidia is what I need; that’s what
nvidia-detect says. So I try to install… bzzt, thank you
for playing.

a: uname -a: 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13
UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
b:
Installing : kmod-nvidia-384.90-1.el7_4.elrepo.x86_64
1/2

Broadcast message from systemd-journald@lyon.cit.nih.gov (Wed 2017-09-27
11:43:12 EDT):

dracut[32409]: /lib/modules/3.10.0-693.el7.x86_64//modules.dep is missing. Did you run depmod?

Message from syslogd@lyon at Sep 27 11:43:12 … dracut:/lib/modules/3.10.0-693.el7.x86_64//modules.dep is missing. Did you run depmod?

Message from syslogd@lyon at Sep 27 11:43:12 … dracut: /lib/modules/3.10.0-693.el7.x86_64//modules.dep is missing. Did you run depmod?
Working. This may take some time …
/lib/modules/3.10.0-693.el7.x86_64//modules.dep is missing. Did you run depmod?
/sbin/weak-modules: line 116: /boot/initramfs-3.10.0-693.el7.x86_64.tmp:
No such file or directory
/sbin/weak-modules: line 132: /boot/initramfs-3.10.0-693.el7.x86_64.tmp:
No such file or directory
/sbin/weak-modules: line 137: /boot/initramfs-3.10.0-693.el7.x86_64.tmp:
No such file or directory Unable to decompress /boot/initramfs-3.10.0-693.el7.x86_64.tmp: Unknown format
/sbin/weak-modules: line 175: /tmp/weak-modules.oC1A7x/new_initramfs.img:
No such file or directory rm: cannot remove ‘/tmp/weak-modules.oC1A7x/new_initramfs.img’: No such file or directory mv: cannot stat ‘/boot/initramfs-3.10.0-693.el7.x86_64.tmp’: No such file or directory Done.
Installing : nvidia-x11-drv-384.90-1.el7.elrepo.x86_64
2/2
etckeeper: post transaction commit
Verifying : kmod-nvidia-384.90-1.el7_4.elrepo.x86_64
1/2
Verifying : nvidia-x11-drv-384.90-1.el7.elrepo.x86_64
2/2

Installed:
kmod-nvidia.x86_64 0:384.90-1.el7_4.elrepo

Dependency Installed:
nvidia-x11-drv.x86_64 0:384.90-1.el7.elrepo

Complete!

Well, no it’s not complete, and it’s trying to install in the *previous*
kernel, not the running one.

mark
hw says:

September 27, 2017 at 11:22 am

m.roth@5-cent.us wrote:

If your intention is to use current NVIDIA drivers, you could try the download from their website. I´ve had good success with installing them directly from the download NVIDIA provides.

I know we aren´t supposed to do that, but after using that for years and then using distribution-provided NVIDIA drivers, I went back to the NVIDIA
package because that was far more trouble-free and continues to be so. When you get a new kernel and when some libraries are updated, you need to reinstall the NVIDIA drivers, but I can live with that.
Phil Perry says:

September 27, 2017 at 1:47 pm

The kmod-nvidia-340xx-340.102-4.el7_4.elrepo.x86_64.rpm driver should work for your card on el7.4.

All previous releases in elrepo were for el7.3 (and earlier) and are not compatible with the el7.4 series kernel.
Phil Perry says:

September 27, 2017 at 2:03 pm

kmod packages are a special class of package on RHEL that take advantage of the stable kernel ABI in Red Hat Enterprise Linux. When a kmod package is compiled against a kernel, the kernel module will be installed for that kernel and the weak-modules script will then weak link the module against all other kABI-compatible kernels installed on the system. This means that you do not need to rebuild the kernel module for each and every kernel update (or worse, delay updating your kernel whilst you wait for me to rebuild the module for you).

So yes, the module will likely be installed against a previous kernel, and maybe one that isn’t even installed on your system. But it will weak link against your current kernel(s) providing none of the kernel symbols used by the module have changed between the kernel the module was built against and the current kernel in question. If you don’t understand, just think of it as magic and be grateful you are running an Enterprise Linux kernel and not a fedora kernel.

As to the earlier error messages, have you been playing with depmod?
Where is your modules.dep for your installed kernels? Anyway, the magic described above has likely not worked correctly due to missing modules.dep, so I would uninstall the nvidia packages, sort out your kernel(s) / depmod information and try again once you have a sane system.
says:

September 27, 2017 at 2:30 pm

Phil Perry wrote:

Ok. I had thought it did. Odd. The original kernel is installed, so I don’t know why modules.dep wasn’t there. I haven’t had to run depmod before.

Btw, about your previous email: nvidia-detect tells me to use kmod-nvidia for the K20c. When I go to the elrepo page about it, and follow the link, for the 340, I don’t see it supporting them, but the non-legacy does.

mark
says:

September 27, 2017 at 3:53 pm

Ok… I’ve cleaned up, ran a depmod on the previous/original kernel, and reinstalled kmod-nvidia. Both the depmod and the install didn’t find a modules.order and another one, but seemed to install fine.

Now, I see that kmod-nvidia includes the nvidia-uvm-kmod, as well as cuda libraries. How do I test to see if it can see the Tesla cards? It used to be that I’d install cuda, build the samples, and run enum_gpu. When I
rebuilt the other server, with a pair of M2090s, I could build the proprietary install, and install cuda, and then build the samples, and run bin/deviceQueryDrv.

Is there something I can run that I can see that it sees the cards? I
haven’t found anything yet.

mark
Phil Perry says:

September 27, 2017 at 3:59 pm

I would trust what nvidia-detect tells you. It is based on the definitive information provided by NVIDIA in their docs:

http://us.download.nvidia.com/XFree86/Linux-x86_64/384.90/README/supportedchips.html
Sorin Srbu says:

September 28, 2017 at 2:50 am

kernels.

My trouble-shooting yesterday just before I went home from work showed that it seemed to have been gdm that borked out for some reason. I’ve never had that happen to me, regardless of CentOS version. Installing lightdm brought everything backup as expected.

Has anybody else had gdm act up?

Weird in any case.
Kretschmer, Jens says:

November 20, 2017 at 5:23 am

Hi Mark,

did you manage to sort out messages from Dracut and /sbin/weak-modules you received while installing kmod-nvidia? We get the same messages while installing kmod-nvidia-384.98-1.el7_4.elrepo.x86_64 on RHEL 7.4 with the kernel 3.10.0-693.5.2.el7.x86_64.

Kr, Jens
Johnny Hughes says:

November 20, 2017 at 7:13 am

weak modules allow modules built on one kernel version to apply to another kernel as long as the ‘build dependencies’ are the same. There are many individual drivers, etc in a given kernel and not all changes involve all areas. If the things that were used to build a given module did not change, it can still be run. If some dependency did change, the module needs to be rebuilt. The exact error wording is important .. it could be a warning or it could be an error requiring module rebuild.

You might get a warning if you do not have the kernel version that the original module was built on, but it might be (most likely is) OK to run anyway so long as no deps are changed.

Anyway, the elrepo guys and gals should know if you need a new module or not for a specific kernel version.
Valeri Galtsev says:

November 20, 2017 at 8:24 am

My comment here may seem to be completely off topic. However:

it is not nvidia driver that you, folks, build/compile. Nvidia driver comes from nvidia as precompiled binary, and neither of nvidia drivers was ever disclosed to anybody outside their company. Even the specifications of their hardware are not fully disclosed, therefore programmers who wrote open source drivers for nvidia hardware could not write better driver that they wrote. Not their fault: they just don’t have all necessary information.

Repeating wrong words “compiling nvidia drivers” creates misperception, which is wide spread for long time. It is not the driver you are compiling, but merely interface between binary nvidia driver and particular kernel.

I probably should add rant tags as I am ranting about the fact that some company seems friendlier to open source than it actually is…

Valeri

++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++
Kretschmer, Jens says:

November 22, 2017 at 3:28 am

Thank you!

I messaged the ELREPO Mailing List. We are currently investigating the cause of the messages.

Br, Jens

Semi-OT: Hardware: NVidia Proprietary Driver, C7.4

21 thoughts on - Semi-OT: Hardware: NVidia Proprietary Driver, C7.4

Recommended

Recent Posts

Recent Comments

Archives

Categories

Meta