diff --git a/common_issues.md b/common_issues.md new file mode 100644 index 0000000..5c45f0c --- /dev/null +++ b/common_issues.md @@ -0,0 +1,58 @@ +## Common Issues + +### Backends Not Up + +The backends need to be up for search to run. When the backends are up, running `npusearch_check` will look something like this: +```bash +lrl_admin@guava:~$ npusearch_check +[ + "npusearch:request:guava-0-1710520177", + "npusearch:request:guava-1-1710520177", + "npusearch:request:guava-10-1710520177", + "npusearch:request:guava-11-1710520177", + "npusearch:request:guava-12-1710520177", + "npusearch:request:guava-13-1710520177", + "npusearch:request:guava-14-1710520177", + "npusearch:request:guava-15-1710520177", + "npusearch:request:guava-2-1710520177", + "npusearch:request:guava-3-1710520177", + "npusearch:request:guava-4-1710520177", + "npusearch:request:guava-5-1710520177", + "npusearch:request:guava-6-1710520177", + "npusearch:request:guava-7-1710520177", + "npusearch:request:guava-8-1710520177", + "npusearch:request:guava-9-1710520177" +] +``` + +When the backend are down, you will see this: +```bash +lrl_admin@guava:~$ npusearch_check +[] +``` + +Here are some steps to try to bring the backends up when they are down: + +#### 1. Restart using systemctl + +Run `sudo systemctl restart npusearch.service`. Wait about 15 seconds, then try `npusearch_check` again. + +#### 2. Check status using systemctl + +Run `sudo systemctl status npusearch.service`. + +- If the last thing it prints is that it's satisfying a license, it got stuck during the startup process. If your device has SmartSSDs, try `sudo systemctl restart mpd.service`. +- If it says "Failed to start NPUSearch search backends." and your device has SmartSSDs, try `sudo systemctl restart mpd.service`. If your device has Kuona cards, try `sudo insmod npusearch`. If that errors, try `sudo dpkg-reconfigure npusearch`. + +Repeat step 1. + +#### 3. Check log messages + +In `/opt/lrl/etc/npusearch.conf` the line `export LOGFILE=path/to/logfile` will be where NPUSearch is writing logs. If the line is commented out, un-comment it and set a path for a log file to be written to. Run steps 1 and 2 again and then read the logs to see where the issues may be. + +### Search performance is lower than expected + +This is commonly caused by the SSDs overheating and throttling. + +- Double check to make sure the fan speed is turned up to minimum 100% on iDRAC. Use iDRAC to check that the inlet air temperature into the server is not too hot. If anything is changed at this step, run the tests again. +- Inspect the `ssd_nvme_smart_log_data.ndjson` file to see how hot the SSDs are getting. Each line of that file is a `nvme smart-log` output for each SSD at a given timestamp. The thermal test will fail if any SSD reaches 349 Kelvin, but some SSDs will throttle performance before getting that hot. If desired, send the `ssd_nvme_smart_log_data.ndjson` and `npusearch_install.log` files to [support@lewis-rhodes.com](support@lewis-rhodes.com). LRL can do detailed analysis to help determine if throttling is happening. \ No newline at end of file