Semi-OT: Hardware: NVidia Proprietary Driver, C7.4

Home » CentOS » Semi-OT: Hardware: NVidia Proprietary Driver, C7.4
CentOS 21 Comments

This is really frustrating. I’ve got a server with two K20c Tesla cards. I
need to use the proprietary drivers to use the CUDA toolkit. Btw, I had no trouble at all with building for CentOS 7.3

I have what NVidia claims is the correct driver package, a 340 series. It appears to build, but then fails to load. The only error I see is “no such device”, which makes no sense to me, esp. since it says nothing whatever else.

I’ve gone through the install log, and there are a bunch of Note:, and warnings, but the later I think are all about comparing signed and unsigned integers.

And lsmod shows no nvidia drivers registered, but the logs claims that Error: Driver ‘nvidia’ is already registered, aborting…

Anyone got any ideas?

mark

21 thoughts on - Semi-OT: Hardware: NVidia Proprietary Driver, C7.4

  • Seconded. We use the elrepo repository for hundreds of workstations and have had no issues. Takes care of everything automatically.

  • You don’t say which version of the 340 series driver you have tried.

    There was a bug with recent legacy releases that affected el7.4 kernels. We (elrepo) patched the driver to fix that on rhel7.4 releases. I’m not sure but it _may_ have been fixed in the 340.104 driver released last week – I’ve not bothered building it as the changelog only mentions
    “Improved compatibility with recent Linux kernels” which we patched/fixed in our the previous release and other issues which don’t affect kmods on RHEL.

    So it sounds like a known issue which has already been fixed. If you don’t want to use our packages, maybe take a look at the patch and try applying it to your build.

  • Yes, but these are Tesla cards with the CUDA toolkit – I’ve never got the elrepo versions to run work properly when developing CUDA
    applications.

    P.

  • After upgrading from 7.3 to 7.4 the GUI won’t start anymore. Using and Nvidia GTX260 with the elrepo drivers. Am investigating, but so far zip…

    Is the article above provided available in English?


    //Sorin

  • I
    no It such

    Tested 340.76, 340.102, 340.104 (elrepo and proprietary). No luck over here with a GTX260 and the 64b-drivers.

    Will test some more, if still no luck, I’ll just reinstall from scratch.

  • Hi, folks,

    Well, still more fun (for values of fun approaching zero):

    1. Went to install CUDA 9.0… well, gee, there is *no* CUDA 9.0.
    Even though I installed the 9 repo, all that I get is 8. I’ve
    used their webform, and an waiting on a reply.
    2. I remove all nvidia packages.
    3. It appears that the kmod-nvidia is what I need; that’s what
    nvidia-detect says. So I try to install… bzzt, thank you
    for playing.

    a: uname -a: 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13
    UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
    b:
    Installing : kmod-nvidia-384.90-1.el7_4.elrepo.x86_64
    1/2

    Broadcast message from systemd-journald@lyon.cit.nih.gov (Wed 2017-09-27
    11:43:12 EDT):

    dracut[32409]: /lib/modules/3.10.0-693.el7.x86_64//modules.dep is missing. Did you run depmod?

    Message from syslogd@lyon at Sep 27 11:43:12 … dracut:/lib/modules/3.10.0-693.el7.x86_64//modules.dep is missing. Did you run depmod?

    Message from syslogd@lyon at Sep 27 11:43:12 … dracut: /lib/modules/3.10.0-693.el7.x86_64//modules.dep is missing. Did you run depmod?
    Working. This may take some time …
    /lib/modules/3.10.0-693.el7.x86_64//modules.dep is missing. Did you run depmod?
    /sbin/weak-modules: line 116: /boot/initramfs-3.10.0-693.el7.x86_64.tmp:
    No such file or directory
    /sbin/weak-modules: line 132: /boot/initramfs-3.10.0-693.el7.x86_64.tmp:
    No such file or directory
    /sbin/weak-modules: line 137: /boot/initramfs-3.10.0-693.el7.x86_64.tmp:
    No such file or directory Unable to decompress /boot/initramfs-3.10.0-693.el7.x86_64.tmp: Unknown format
    /sbin/weak-modules: line 175: /tmp/weak-modules.oC1A7x/new_initramfs.img:
    No such file or directory rm: cannot remove ‘/tmp/weak-modules.oC1A7x/new_initramfs.img’: No such file or directory mv: cannot stat ‘/boot/initramfs-3.10.0-693.el7.x86_64.tmp’: No such file or directory Done.
    Installing : nvidia-x11-drv-384.90-1.el7.elrepo.x86_64
    2/2
    etckeeper: post transaction commit
    Verifying : kmod-nvidia-384.90-1.el7_4.elrepo.x86_64
    1/2
    Verifying : nvidia-x11-drv-384.90-1.el7.elrepo.x86_64
    2/2

    Installed:
    kmod-nvidia.x86_64 0:384.90-1.el7_4.elrepo

    Dependency Installed:
    nvidia-x11-drv.x86_64 0:384.90-1.el7.elrepo

    Complete!

    Well, no it’s not complete, and it’s trying to install in the *previous*
    kernel, not the running one.

    mark

  • m.roth@5-cent.us wrote:

    If your intention is to use current NVIDIA drivers, you could try the download from their website. I´ve had good success with installing them directly from the download NVIDIA provides.

    I know we aren´t supposed to do that, but after using that for years and then using distribution-provided NVIDIA drivers, I went back to the NVIDIA
    package because that was far more trouble-free and continues to be so. When you get a new kernel and when some libraries are updated, you need to reinstall the NVIDIA drivers, but I can live with that.

  • The kmod-nvidia-340xx-340.102-4.el7_4.elrepo.x86_64.rpm driver should work for your card on el7.4.

    All previous releases in elrepo were for el7.3 (and earlier) and are not compatible with the el7.4 series kernel.

  • kmod packages are a special class of package on RHEL that take advantage of the stable kernel ABI in Red Hat Enterprise Linux. When a kmod package is compiled against a kernel, the kernel module will be installed for that kernel and the weak-modules script will then weak link the module against all other kABI-compatible kernels installed on the system. This means that you do not need to rebuild the kernel module for each and every kernel update (or worse, delay updating your kernel whilst you wait for me to rebuild the module for you).

    So yes, the module will likely be installed against a previous kernel, and maybe one that isn’t even installed on your system. But it will weak link against your current kernel(s) providing none of the kernel symbols used by the module have changed between the kernel the module was built against and the current kernel in question. If you don’t understand, just think of it as magic and be grateful you are running an Enterprise Linux kernel and not a fedora kernel.

    As to the earlier error messages, have you been playing with depmod?
    Where is your modules.dep for your installed kernels? Anyway, the magic described above has likely not worked correctly due to missing modules.dep, so I would uninstall the nvidia packages, sort out your kernel(s) / depmod information and try again once you have a sane system.

  • Phil Perry wrote:

    Ok. I had thought it did. Odd. The original kernel is installed, so I don’t know why modules.dep wasn’t there. I haven’t had to run depmod before.

    Btw, about your previous email: nvidia-detect tells me to use kmod-nvidia for the K20c. When I go to the elrepo page about it, and follow the link, for the 340, I don’t see it supporting them, but the non-legacy does.

    mark

  • Ok… I’ve cleaned up, ran a depmod on the previous/original kernel, and reinstalled kmod-nvidia. Both the depmod and the install didn’t find a modules.order and another one, but seemed to install fine.

    Now, I see that kmod-nvidia includes the nvidia-uvm-kmod, as well as cuda libraries. How do I test to see if it can see the Tesla cards? It used to be that I’d install cuda, build the samples, and run enum_gpu. When I
    rebuilt the other server, with a pair of M2090s, I could build the proprietary install, and install cuda, and then build the samples, and run bin/deviceQueryDrv.

    Is there something I can run that I can see that it sees the cards? I
    haven’t found anything yet.

    mark

  • kernels.

    My trouble-shooting yesterday just before I went home from work showed that it seemed to have been gdm that borked out for some reason. I’ve never had that happen to me, regardless of CentOS version. Installing lightdm brought everything backup as expected.

    Has anybody else had gdm act up?

    Weird in any case.

  • Hi Mark,

    did you manage to sort out messages from Dracut and /sbin/weak-modules you received while installing kmod-nvidia? We get the same messages while installing kmod-nvidia-384.98-1.el7_4.elrepo.x86_64 on RHEL 7.4 with the kernel 3.10.0-693.5.2.el7.x86_64.

    Kr, Jens

  • weak modules allow modules built on one kernel version to apply to another kernel as long as the ‘build dependencies’ are the same. There are many individual drivers, etc in a given kernel and not all changes involve all areas. If the things that were used to build a given module did not change, it can still be run. If some dependency did change, the module needs to be rebuilt. The exact error wording is important .. it could be a warning or it could be an error requiring module rebuild.

    You might get a warning if you do not have the kernel version that the original module was built on, but it might be (most likely is) OK to run anyway so long as no deps are changed.

    Anyway, the elrepo guys and gals should know if you need a new module or not for a specific kernel version.

  • My comment here may seem to be completely off topic. However:

    it is not nvidia driver that you, folks, build/compile. Nvidia driver comes from nvidia as precompiled binary, and neither of nvidia drivers was ever disclosed to anybody outside their company. Even the specifications of their hardware are not fully disclosed, therefore programmers who wrote open source drivers for nvidia hardware could not write better driver that they wrote. Not their fault: they just don’t have all necessary information.

    Repeating wrong words “compiling nvidia drivers” creates misperception, which is wide spread for long time. It is not the driver you are compiling, but merely interface between binary nvidia driver and particular kernel.

    I probably should add rant tags as I am ranting about the fact that some company seems friendlier to open source than it actually is…

    Valeri

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • Thank you!

    I messaged the ELREPO Mailing List. We are currently investigating the cause of the messages.

    Br, Jens