[solved] Linux software RAID-1 fallover problem.

Note: this is a partial post. I may fill in more information at a later time. However, the solution is stated at the bottom and may be helpful already.

The Problem history

A few days ago I set up a rsnapshot backup solution and when i ran it for the first time, my linux software raid-1 (2x 3TB – 6GB/s SATA) fell over. It wasn’t tragic, since after a reboot it simply re-synced. However, this happened reliably every time I did a backup and then the re-syncs started to fail as well.

Turns out my motherboard (Asus P7H57D-V EVO) is using a Marvell 88SE6111 chip for the two SATA 6GB/s SATA connectors it provides and which I am using for the two disks in question. All my other disks are attached to an intel chip that only offers SATA 3GB/s connections. This Marvell chip seems to have poor support under linux. However, the controller is set in the BIOS to work via AHCI and the disks do work under normal conditions without a problem.

The current kernel is stated below:

Linux 3.10.7-1-ARCH x86_64 GNU/Linux

I checked the disks (both are new) using smarctctl and found no issues. When the raid was synced, everything worked fine and the smartctl command showed good new disks. Once the backup had toppled the raid or the re-sync failed, smartctl failed with the following errors shown in dmesg:

[ 7483.951154] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[ 7483.951179] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

Checking dmesg for a hint on why the disks/raid failed, I found the following errors (the relevant hard drives are /dev/sdf and /dev/sde. (… marks sections where I cut repetitions):

[ 1128.512259] ata7.00: exception Emask 0x0 SAct 0xffff SErr 0x0 action 0x6 frozen
[ 1128.512268] ata7.00: failed command: READ FPDMA QUEUED
[ 1128.512277] ata7.00: cmd 60/00:00:80:67:95/01:00:0f:00:00/40 tag 0 ncq 131072 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1128.512281] ata7.00: status: { DRDY }
...
[ 1128.512481] ata7.00: failed command: READ FPDMA QUEUED
[ 1128.512488] ata7.00: cmd 60/80:78:00:5e:95/00:00:0f:00:00/40 tag 15 ncq 65536 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1128.512492] ata7.00: status: { DRDY }
[ 1128.512499] ata7: hard resetting link
[ 1129.002169] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 370)
[ 1129.003682] ata7.00: configured for UDMA/133
[ 1129.003793] ata7.00: device reported invalid CHS sector 0
[ 1129.003798] ata7.00: device reported invalid CHS sector 0
...
[ 1129.003905] ata7: EH complete
[ 1611.769630] ACPI: EC: GPE storm detected(10 GPEs), transactions will use polling mode
[ 2161.457698] ata8.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x6 frozen
[ 2161.457709] ata8.00: failed command: WRITE FPDMA QUEUED
[ 2161.457718] ata8.00: cmd 61/00:00:80:c4:2c/02:00:1e:00:00/40 tag 0 ncq 262144 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 2161.457723] ata8.00: status: { DRDY }
...
[ 5628.308982] ata8.00: failed command: WRITE FPDMA QUEUED
[ 5628.308990] ata8.00: cmd 61/80:50:80:34:44/01:00:50:00:00/40 tag 10 ncq 196608 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 5628.308993] ata8.00: status: { DRDY }
[ 5628.309000] ata8: hard resetting link
[ 5638.311674] ata8: softreset failed (1st FIS failed)
[ 5638.311686] ata8: hard resetting link
[ 5648.314623] ata8: softreset failed (1st FIS failed)
[ 5648.314633] ata8: hard resetting link
[ 5683.316659] ata8: softreset failed (1st FIS failed)
[ 5683.316671] ata8: limiting SATA link speed to 3.0 Gbps
[ 5683.316675] ata8: hard resetting link
[ 5688.319776] ata8: softreset failed (1st FIS failed)
[ 5688.319788] ata8: reset failed, giving up
[ 5688.319792] ata8.00: disabled
[ 5688.319798] ata8.00: device reported invalid CHS sector 0
[ 5688.319802] ata8.00: device reported invalid CHS sector 0
[ 5688.319805] ata8.00: device reported invalid CHS sector 0
[ 5688.319808] ata8.00: device reported invalid CHS sector 0
[ 5688.319811] ata8.00: device reported invalid CHS sector 0
[ 5688.319814] ata8.00: device reported invalid CHS sector 0
[ 5688.319816] ata8.00: device reported invalid CHS sector 0
[ 5688.319819] ata8.00: device reported invalid CHS sector 0
[ 5688.319822] ata8.00: device reported invalid CHS sector 0
[ 5688.319825] ata8.00: device reported invalid CHS sector 0
[ 5688.319828] ata8.00: device reported invalid CHS sector 0
[ 5688.319897] ata8: EH complete
[ 5688.319932] sd 7:0:0:0: [sdf] Unhandled error code
[ 5688.319936] sd 7:0:0:0: [sdf]
[ 5688.319939] Result: hostbyte=0x04 driverbyte=0x00
[ 5688.319942] sd 7:0:0:0: [sdf] CDB:
[ 5688.319944] cdb[0]=0x8a: 8a 00 00 00 00 00 50 44 34 80 00 00 01 80 00 00
[ 5688.319960] end_request: I/O error, dev sdf, sector 1346647168
[ 5688.320033] sd 7:0:0:0: [sdf] Unhandled error code
[ 5688.320037] sd 7:0:0:0: [sdf]
[ 5688.320040] md/raid1:md3: Disk failure on sdf1, disabling device.
md/raid1:md3: Operation continuing on 1 devices.
[ 5688.320045] Result: hostbyte=0x04 driverbyte=0x00
[ 5688.320048] sd 7:0:0:0: [sdf] CDB:
[ 5688.320049] cdb[0]=0x8a: 8a 00 00 00 00 00 50 44 32 00 00 00 02 80 00 00
[ 5688.320063] end_request: I/O error, dev sdf, sector 1346646528
[ 5688.320076] sd 7:0:0:0: [sdf] Unhandled error code
[ 5688.320079] sd 7:0:0:0: [sdf]
[ 5688.320081] Result: hostbyte=0x04 driverbyte=0x00
[ 5688.320084] sd 7:0:0:0: [sdf] CDB:
...
[ 5688.320349] cdb[0]=0x8a: 8a 00 00 00 00 00 50 44 26 80 00 00 02 80 00 00
[ 5749.344064] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 5749.344075] ata7.00: failed command: FLUSH CACHE EXT
[ 5749.344087] ata7.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 5749.344092] ata7.00: status: { DRDY }
[ 5749.344101] ata7: hard resetting link
[ 5759.350285] ata7: softreset failed (1st FIS failed)
[ 5759.350296] ata7: hard resetting link
[ 5769.353267] ata7: softreset failed (1st FIS failed)
[ 5769.353277] ata7: hard resetting link
[ 5804.351941] ata7: softreset failed (1st FIS failed)
[ 5804.351953] ata7: limiting SATA link speed to 3.0 Gbps
[ 5804.351957] ata7: hard resetting link
[ 5809.351710] ata7: softreset failed (1st FIS failed)
[ 5809.351723] ata7: reset failed, giving up
[ 5809.351727] ata7.00: disabled
[ 5809.351733] ata7.00: device reported invalid CHS sector 0
[ 5809.351759] ata7: EH complete
[ 5809.351815] sd 6:0:0:0: [sde] Unhandled error code
[ 5809.351820] sd 6:0:0:0: [sde]
[ 5809.351822] Result: hostbyte=0x04 driverbyte=0x00
[ 5809.351826] sd 6:0:0:0: [sde] CDB:
[ 5809.351828] cdb[0]=0x8a: 8a 00 00 00 00 00 00 00 08 08 00 00 00 01 00 00
[ 5809.351844] blk_update_request: 1 callbacks suppressed
[ 5809.351847] end_request: I/O error, dev sde, sector 2056
[ 5809.351852] end_request: I/O error, dev sde, sector 2056
[ 5809.351855] md: super_written gets error=-5, uptodate=0
[ 5809.352927] md: checkpointing resync of md3.
[ 5809.352987] sd 6:0:0:0: [sde] Unhandled error code
[ 5809.352993] sd 6:0:0:0: [sde]
[ 5809.352997] Result: hostbyte=0x04 driverbyte=0x00
[ 5809.353002] sd 6:0:0:0: [sde] CDB:
[ 5809.353005] cdb[0]=0x8a: 8a 00 00 00 00 00 00 00 08 08 00 00 00 01 00 00
[ 5809.353027] end_request: I/O error, dev sde, sector 2056
[ 5809.353032] end_request: I/O error, dev sde, sector 2056
[ 5809.353036] md: super_written gets error=-5, uptodate=0
[ 5809.353140] sd 6:0:0:0: [sde] Unhandled error code
[ 5809.353146] sd 6:0:0:0: [sde]
[ 5809.353150] Result: hostbyte=0x04 driverbyte=0x00
[ 5809.353155] sd 6:0:0:0: [sde] CDB:
...

The way I read this was as follows. The physical symptoms are that the raid stops to respond for about 30 seconds, and then works again. These freeze problems then lead to a timeout on the rsync backup or the re-sync action. The cause for the freeze may ultimately be the Marvell controller (since my software raid-5 with 3 disks on the Intel chip has no such problems), but it may well be due to an overload during backup / sync_action.

My solution to the problem

Since I could not find a better driver for the Marvell chip, and since hours of research on the web did not yield a better solution, I decided to prevent the cause instead of curing it. It seems that the software raid only falls over when it is under immense load, such as during a re-sync or when the rsnapshot tool is continuously spitting data at the raid. By setting bandwidth limits for the mdadm sync_action and the rsync tool underlying the rsnapshot software, I was able to prevent the problem from occurring again.

Setting a bandwidth limit on the mdadm re-sync action as described here:

sysctl -w dev.raid.speed_limit_max=value

I chose the value of 124000, which is slightly lower than the speed stated by mdstat during successful syncing i.e. before the re-sync action overloaded the raid. To get that value, type the following while the raid is happily re-syncing:

$ cat /proc/mdstat

Check the current limits for the resync action:

# sysctl dev.raid.speed_limit_min
# sysctl dev.raid.speed_limit_max

To permanently override the default, I added this line to /etc/sysctl.conf:

## New max limit to prevent raid failure
dev.raid.speed_limit_max = value

To prevent rsnapshot from overloading the raid during a backup, I set the bwlimit parameter in the rsnapshot config file at /etc/rsnapshot.conf. As you can see, I also set a 3min timout to give the raid some time to unfreeze, should that happen again:

# Default rsync args. All rsync commands have at least these options set.
rsync_long_args	--delete --numeric-ids --relative --delete-excluded --timeout=180 --bwlimit=12000

I have since run a backup and found no problems.

Ammendment (2013-08-30)

While the raid system has not fallen over since I implemented the fix described above, the underlying problem with the hdd-freeze and timeout issues has not been resolved!!

Installing the Canon PIXMA iP4500 printer on 64bit Linux – openSUSE 10.3

Ok, so this wasn’t too hard, but there was a little hiccup and so I thought I’d briefly note down the solution.

The first step is to find the actual driver because the iP4500 model not listed in the standard set of drivers (at least not in mine). Choosing one of the other PIXMA drivers (like the 4100) will only result in blanc pages.

1) Find the drivers on the Canon website (google search for it. I would post the link but the page is currently not working for me, it did so a minute ago though).
There are two files, one is a common driver (file name: cnijfilter-common-2.80-1.i386.rpm) and the other one is the model specific driver (file name: cnijfilter-ip4500series-2.80-1.i386.rpm). They need to be installed in order, so I suggest you also download the instructions, a file called: guideip4500series-pd-2.80-1.tar.tar
The instructions are very clear and come in flavours for Fedora 7, openSUSE 10.3 and Ubuntu 7.04 (all the same file).

2) After the installation I tried to print but without success. Looking at the printer properties (Control Center – Printer – <printer name> (right click) -> Properties) I found the error message:

 /usr/lib64/cups/backend/cnij_usb failed

I did a google search, and the solution is the following (as found on Linuxuser LigLog).

Open a terminal and as root, type the following two commands:

ln /usr/lib/cups/backend/cnij_usb /usr/lib64/cups/backend/cnij_usb
ln /usr/lib/cups/filter/pstocanonij /usr/lib64/cups/filter/pstocanonij

Restart CUPS: /etc/init.d/cups restart and it should work.

How to access multiple GTK+ widgets using Glib and the GObject

See my former post on “How to access two GTK+ widgets from a single callback function (using libglade)” for a different way of achieving the same thing. However, the method described below is the official way of doing things and thus to be preferred.

GtkWidgets are derived from GObjects. They can be cast back and forth using the G_OBJECT() and GTK_WIDGET() macros. Use g_object_set_data() to attach a gpointer and its key to a GObject of your choice (find the definition at library.gnome.org). The pointers can then be retrieved wherever the GObject that they have been attached to, is available. This is done using g_object_get_data(). Attaching two pointers to two different GtkWidgets to the main window widget, will enable you to call on them from any callback function you use.

How to access two GTK+ widgets from a single callback function (using libglade).

I hope the code below may be useful to some people. I am new to GTK+ myself so I am not sure whether this is the best solution, but it is certainly on of the better ones. Other solutions would be to use globals or possibly structs.

What the function does:
It is a simple callback function that has been connected to the GladeXML file by the following line:
glade_xml_signal_connect(gxml, "on_execute_recording_button_clicked", G_CALLBACK(on_execute_recording_button_clicked));
where gxml is the main GladeXML file that was created using Glade-3. Look on Micah Carrick’s blog for information and tutorials on Glade.

Using glade_get_widget_tree() we can get hold of the GladeXML file inside the callback function without using globals. From there is it easy, we simply use the glade_xml_get_widget() function to create the widgets we want.

Here is the code:

/*
 * Function aquires a handle on the GtkProgressBar and the GtkTreeView widget
 * then calls a function that works on both widgets.
 *
*/
void on_execute_recording_button_clicked  (GtkButton *button, GtkWidget *window, gpointer gptr)
{
	GladeXML *gxml;
	GtkWidget  *pbarWidget, *data_viewWidget;
	GtkProgressBar *pbar;
	GtkTreeView *data_view;

	gxml = glade_get_widget_tree(GTK_WIDGET(button));

	pbarWidget = glade_xml_get_widget (gxml, "record_progressbar");
	pbar =  GTK_PROGRESS_BAR(pbarWidget);

	data_viewWidget = glade_xml_get_widget (gxml, "data_view");
	data_view = GTK_TREE_VIEW(data_viewWidget);

	get_data(pbar, data_view);
}

Solution to the “Aborted” error on openSUSE linux for programs using the National Instruments GPIB (IEEE-488.2) controller

The problem:
I am using an interface from National Instruments for the General Purpose Instrumentation Bus (GPIB) and when I tried to execute the program in the terminal, I only got an “Aborted” message back. The program had been working before and still compiled without errors or warnings but it would not execute. The program terminated imediatey after it was called, before even entering main().

The solution:

The problem lies with the ni-488.2 driver and may be solved by upating the driver. Every kernel update will result in the “Aborted” error reappearing. Every kernel update thus requires an update of the driver if you want to use the ni488 interface in your software.

The steps to take are similar to the initial installation. The following steps are quoted directly from the NI 488.2 README file:

1. Ensure you have installed kernel-source and kernel-syms packages
corresponding to the version of the currently running kernel. You can
determine the currently running kernel version by issuing the command
uname -r.
2. Change the directory to the /usr/src/linux-<version>directory, where
<version> corresponds to the currently running kernel version.
3. Run make cloneconfig as root to configure the sources for the
currently running kernel.
4. Run make modules_prepare as root to prepare the headers for
compilation.

Now go to /usr/local/bin and run the “updateNIDrivers” utility as root. That should update your NI driver and remove the problem.

Installing the Japanese language pack for Acrobat Reader (8) on OpenSUSE Linux

Today, while reading up on possible suppliers of HV-supplies, my Adobe Reader complained that it could not read part of a document without the Japanese language extension pack. Very conveniently the pop-up warning message contains the address of the adobe website where one can download the language extension for free (make sure you get the correct file for your version of acrobat reader):
http://www.adobe.com/products/acrobat/acrrasianfontpack.html

Its a simple matter of unpacking the tar.gz file, entering the created folder (“JPNKIT” in my case) in a terminal window, switching to root and typing: ./INSTALL
However, as part of the installation procedure, the install script asks for the correct location of your Adobe Reader installation. Now maybe it’s because I am not a linux Pro, but I did not know which directory the script was asking for. It was not the /usr/bin/acroread path. After a very unlucky search on google for the correct path, I finally figured it out. The path to use was the following:
/usr/lib
The installer finished and I can go on with my work. Maybe someone out there will find this post helpful.