The one thing that is causing me much grief is that the floppy drive works fine under 1.2.13, but loses characters under the 2.0 kernel. I'm in the process of compiling a minimal 2.0 kernel now (even turning off things that shouldn't have any effect). If that works I'll then do a binary search until I can find out what difference causes the floppy drive to have trouble.
Well, that didn't work, 2.0 had trouble with the floppy drive on that machine even in a minimal configuration. However, I noticed that 2.0.3 had been released, so I brought it over, compiled, installed and booted, and voila! Trouble-free floppy.
Howdy,
I recently loaded Red Hat 3.0.3 onto my Millennia TransPort laptop, and on top of that I installed the dozen or so RPMs to allow me to run the 2.0 kernel. When I started running 2.0 I found that the floppy drive wouldn't necessarily read properly -- no errors were reported, but data was coming off with bytes missing.
I tried doing a make config with a minimal setting (e.g. no network, no sound, etc., and with the floppy directly compiled in), but even the minimal version had this problem. I could reboot 1.2.13 and the floppy would work fine.
It was then time to sleep, so I put the laptop away, came back and decided to try 2.0.3. Yay! It worked, or so my test case suggested. I posted to comp.sys.laptops that 2.0.3 had solved my problem (I had mentioned the problem in relation to a query about Millenia Transports and Linux) and went about doing non-floppy stuff.
However, my test case really was just doing a
tar tfvz /dev/fd0
that had a raw .tar.gz catted to it from another machine and then seeing that tar didn't complain. I wasn't actually *using* the data. Later in the day when it came time to use the floppy I found that it was still fried. I could try it under 2.0.3 and it wouldn't work, then reboot 2.1.13 (without powering down or changing the floppy or running Windows or anything) and it would work, then I could reboot again to 2.0.3 and it would fail. That suggests that my problem isn't a flakey floppy drive, nor is it some weird configuration issue that was getting changed when I was power cycling or removing the floppy drive from its bay or anything weird like that.
dmesg doesn't show any error messages coming out when I try to access the floppy (unless I try to mount a filesystem in which case the filesystem mount code usually issues error messages since the filesystem will appear to be corrupted).
I'd be happy to run tests over here to help debug this, but I don't know much about Linux kernel debugging. It's literally been more than a decade since I used to get paid to create or hack up UNIX device drivers and the tools have changed greatly (gotten *MUCH* better).
--Cliff [Matthews]
ctm@ardi.com
>>>>> "Cliff" == Clifford T Matthews <ctm@ardi.com> writes:
Cliff> Howdy, I recently loaded Red Hat 3.0.3 onto my Millennia Cliff> TransPort laptop, and on top of that I installed the dozen Cliff> or so RPMs to allow me to run the 2.0 kernel. When I Cliff> started running 2.0 I found that the floppy drive wouldn't Cliff> necessarily read properly -- no errors were reported, but Cliff> data was coming off with bytes missing.After I made this post, Alain Knaff was kind enough to send me some email with suggestions. Reading his suggestions and rereading what I originally wrote leads me to think that some people may have dismissed this bug report as just a post from someone who didn't know how to use tar or know about the differences in line terminators of various OSes.
I replied to Alain and asked for permission to post my reply, since it contains more details of what's going wrong. He didn't mind, so here's some more information.
--Cliff
ctm@ardi.com
To: Alain.Knaff@imag.fr
Cc: ctm@ardi.com
Subject: Re: Millennia TransPort floppy problem with 2.0.0 and 2.0.3?
>>>>> "Alain" == Alain KNAFF <Alain.Knaff@imag.fr> writes:
[edited... -CRS]
Alain> Btw, the same thing applies to ftp, always say 'binary' Alain> before transfers. Actually, in 99% of the cases, it works Alain> anyways, even without binary, but in 1% it may break Alain> down. Beware!Yes, right. I'm sorry I didn't make it sufficiently clear what I meant.
Specifically, I was writing and reading directly to/from /dev/fd0. I had a floppy that I was able to do this:
boot 1.2.13
dd if=/dev/fd0 of=/tmp/floppy.good bs=4k count=1
boot 2.0.0
dd if=/dev/fd0 of=/tmp/floppy.bad bs=4k count=1
cmp -l /tmp/floppy.good /tmp/floppy.bad
and then eyeball the differences (which would start early) and see that occasionally bytes would be missing. NOTE: I didn't look beyond the first few bytes because I had an easy way to see corruption and I was more interested in finding out under what circumstances I'd see corruption and which ones I wouldn't. I *never* saw corruption under 1.2.13. I always saw corruption under 2.0.0, then initially I didn't see corruption under 2.0.3 so I let my guard down, but then later in the day (after I had switched the floppy for the battery and back and done a few other things as well), I saw corruption under 2.0.3. The strange thing is that I could then boot 1.2.13 and see no corruption, reboot 2.0.3 and see corruption.
Today I'm not seeing corruption under 2.0.3, so it appears that the machine gets into a state where 1.2.13 will always work but 2.x will fail, but that the machine isn't necessarily in that state if it's been turned off for a while.
Cliff> :I tried doing a make config with a minimal setting Cliff> (e.g. no network, :no sound, etc., and with the floppy Cliff> directly compiled in), but even :the minimal version had Cliff> this problem. I could reboot 1.2.13 and the :floppy would Cliff> work fine. : :It was then time to sleep, so I put the Cliff> laptop away, came back and :decided to try 2.0.3. Yay! It Cliff> worked, or so my test case suggested. :I posted to Cliff> comp.sys.laptops that 2.0.3 had solved my problem (I had Cliff> :mentioned the problem in relation to a query about Cliff> Millenia Transports :and Linux) and went about doing Cliff> non-floppy stuff. : :However, my test case really was just Cliff> doing a : :tar tfvz /dev/fd0 : :that had a raw .tar.gz Cliff> catted to it from another machine and then :seeing that tar Cliff> didn't complain. I wasn't actually *using* the data. Alain> If tar didn't complain when detarring (and most Alain> importantly unzipping) that file, it _was_ probably ok. No Alain> need to use it: the gzip algorithm is very sensible to even Alain> the tiniest perturbation, and if there were a problem, you Alain> _would_ have seen it at that stage. If the _contents_ of Alain> your tar file was corrupted, it became probably corrupted Alain> before it was tarred up.I think you're correct. Most likely when I did my first test the floppy drive *was* working correctly, like it is now. Something happened later in the day that resulted in the weirdness. I assume it will happen again and I'll investigate more (I had an important meeting to prepare for earlier so I couldn't do as much testing as I'd like).
Cliff> :Later in the day when it came time to use the floppy I Cliff> found that it :was still fried. Alain> Please be more precise. If you complain that it 'is still Alain> fried' you might get responses like 'then put it into the Alain> fridge' :-). What happened exactly (output...). What Alain> happens with simple test cases such as cat /dev/fd0 Alain> >imag.file. Do you get a file that is shorter than 1474560 Alain> bytes, or what?I never looked at the size since the corruption always happened so early into the floppy access and at the time I was primarily concerned with moving data around in preparation for the meeting. I did look at the first few blocks with cmp -l because I was curious. I didn't know the problem was going to be semi-intermittent so my curiousity was momentarily sated.
Cliff> : I could try it under 2.0.3 and it wouldn't work, :then Cliff> reboot 2.1.13 (without powering down or changing the floppy Cliff> or :running Windows or anything) and it would work, then I Cliff> could reboot :again to 2.0.3 and it would fail. That Cliff> suggests that my problem isn't :a flakey floppy drive, nor Cliff> is it some weird configuration issue that :was getting Cliff> changed when I was power cycling or removing the floppy Cliff> :drive from its bay or anything weird like that. Alain> This whole thing is extremely odd. Missing bytes (if Alain> indeed that is what is going on) are a serious issue, and Alain> would probably provoke a kernel panic right away. Indeed, Alain> much kernel code heavily depends on the fact that a disk Alain> sector is always 512 bytes, and things would break down Alain> rather quickly if ever it turned out to be shorter.Well, missing bytes is indeed what cmp -l was showing me. I haven't looked at the floppy driver code, but often block devices return how many blocks have been read, and they have no way of returning that 500 bytes of 512 have been transferred, with 12 lost, because that *shouldn't* happen. My guess is something weird is happening during dma.
Cliff> :dmesg doesn't show any error messages coming out when I Cliff> try to access :the floppy (unless I try to mount a Cliff> filesystem in which case the :filesystem mount code usually Cliff> issues error messages since the :filesystem will appear to Cliff> be corrupted). : :I'd be happy to run tests over here to Cliff> help debug this, but I don't :know much about Linux kernel Cliff> debugging. Alain> First, let's make sure we know exactly what happens. Work Alain> from an ext2 partition on your hard disk (to make sure to Alain> avoid binary translation problems on msdos, umsdos and vfat Alain> filesystems). Make a file of exactly 1474560 characters Alain> (you can do this by concatenating several large files, Alain> until you have more than the requested amount, and then Alain> trim it down to 1474560 characters by dd if=file1 of=file2 Alain> bs=1024 count=1440). Then cat it to floppy using a known Alain> working system (for example 1.2.13). Reboot into the broken Alain> system (2.0.3), and cat it from the floppy to another Alain> file. Check the size of the new file. If the size is lots Alain> smaller than 1474560, check how many bytes are actually Alain> missing. If the missing amount is rather large, and a Alain> multiple of 512 bytes, your disk might be bad, and there Alain> could be unreadable sectors. If the missing amount is Alain> rather small, and not a multiple of 512 bytes, then the bug Alain> really exists. Try 'cmp -l oldfile newfile | head' to find Alain> out where exactly the first byte is missing. It would be Alain> most interesting to see whether this happens at a sector Alain> boundary, for example. Knowing what value is missing might Alain> be intersting too 0x0a or 0x0d are very suspicious.I don't need to do that, since I already determined that the missing bytes were very early into the data (within the first 20 bytes). I'll do more testing of my own devising when the laptop gets into this weird mode again. When it was first happening I didn't realize it *was* a weird mode, since it looked to me that simply 1.2.13 worked and 2.0 didn't. It wasn't until I brought 2.0.3 over and found that it worked initially and then failed later in the day that I realized that an already bizarre problem was even more so (which isn't too surprising, since the obvious bugs get fixed near immediately). In both cases when 2.0 and 2.0.3. were biting me, I'd reboot to 1.2.13 just to verify that 1.2.13 would always do the right thing, and it did.
Alain> If on the other hand, both files have the same size, cmp -l Alain> them to make sure that we don't have a different kind of Alain> corruption. Alain> If they turn out to be equal, repeat the experience Alain> different times of the day, under different patterns of Alain> load, until something interesting turns up. Cliff> : It's literally been more than :a decade since I used to Cliff> get paid to create or hack up UNIX device :drivers and the Cliff> tools have changed greatly (gotten *MUCH* better). : Cliff> :--Cliff [Matthews] :ctm@ardi.comSeriously, I did write UNIX device drivers back then. I added support for SCSI disconnect and reselect to a 680x0 based UNIX SCSI driver back in the days when Sun-2s still totally locked up everytime you rewound the tape (because they didn't do disconnect and reselect). I may not know my way around the Linux kernel with my eyes shut, but I apologize for making my original bug report sound like I was just wet behind the ears and I couldn't get tar to work.