06 Jul 1996 13:18

The one thing that is causing me much grief is that the floppy drive works fine under 1.2.13, but loses characters under the 2.0 kernel. I'm in the process of compiling a minimal 2.0 kernel now (even turning off things that shouldn't have any effect). If that works I'll then do a binary search until I can find out what difference causes the floppy drive to have trouble.

07 Jul 1996 08:33

Well, that didn't work, 2.0 had trouble with the floppy drive on that machine even in a minimal configuration. However, I noticed that 2.0.3 had been released, so I brought it over, compiled, installed and booted, and voila! Trouble-free floppy.

07 Jul 1996 18:13

Howdy,

I recently loaded Red Hat 3.0.3 onto my Millennia TransPort laptop, and on top of that I installed the dozen or so RPMs to allow me to run the 2.0 kernel. When I started running 2.0 I found that the floppy drive wouldn't necessarily read properly -- no errors were reported, but data was coming off with bytes missing.

I tried doing a make config with a minimal setting (e.g. no network, no sound, etc., and with the floppy directly compiled in), but even the minimal version had this problem. I could reboot 1.2.13 and the floppy would work fine.

It was then time to sleep, so I put the laptop away, came back and decided to try 2.0.3. Yay! It worked, or so my test case suggested. I posted to comp.sys.laptops that 2.0.3 had solved my problem (I had mentioned the problem in relation to a query about Millenia Transports and Linux) and went about doing non-floppy stuff.

However, my test case really was just doing a

tar tfvz /dev/fd0

that had a raw .tar.gz catted to it from another machine and then seeing that tar didn't complain. I wasn't actually *using* the data. Later in the day when it came time to use the floppy I found that it was still fried. I could try it under 2.0.3 and it wouldn't work, then reboot 2.1.13 (without powering down or changing the floppy or running Windows or anything) and it would work, then I could reboot again to 2.0.3 and it would fail. That suggests that my problem isn't a flakey floppy drive, nor is it some weird configuration issue that was getting changed when I was power cycling or removing the floppy drive from its bay or anything weird like that.

dmesg doesn't show any error messages coming out when I try to access the floppy (unless I try to mount a filesystem in which case the filesystem mount code usually issues error messages since the filesystem will appear to be corrupted).

I'd be happy to run tests over here to help debug this, but I don't know much about Linux kernel debugging. It's literally been more than a decade since I used to get paid to create or hack up UNIX device drivers and the tools have changed greatly (gotten *MUCH* better).

--Cliff [Matthews]
ctm@ardi.com

08 Jul 1996 16:16

>>>>> "Cliff" == Clifford T Matthews <ctm@ardi.com> writes:

    Cliff> Howdy, I recently loaded Red Hat 3.0.3 onto my Millennia
    Cliff> TransPort laptop, and on top of that I installed the dozen
    Cliff> or so RPMs to allow me to run the 2.0 kernel.  When I
    Cliff> started running 2.0 I found that the floppy drive wouldn't
    Cliff> necessarily read properly -- no errors were reported, but
    Cliff> data was coming off with bytes missing.
After I made this post, Alain Knaff was kind enough to send me some email with suggestions. Reading his suggestions and rereading what I originally wrote leads me to think that some people may have dismissed this bug report as just a post from someone who didn't know how to use tar or know about the differences in line terminators of various OSes.

I replied to Alain and asked for permission to post my reply, since it contains more details of what's going wrong. He didn't mind, so here's some more information.

--Cliff
ctm@ardi.com

To: Alain.Knaff@imag.fr
Cc: ctm@ardi.com
Subject: Re: Millennia TransPort floppy problem with 2.0.0 and 2.0.3?

>>>>> "Alain" == Alain KNAFF <Alain.Knaff@imag.fr> writes:
[edited... -CRS]

    Alain>  Btw, the same thing applies to ftp, always say 'binary'
    Alain> before transfers. Actually, in 99% of the cases, it works
    Alain> anyways, even without binary, but in 1% it may break
    Alain> down. Beware!
Yes, right. I'm sorry I didn't make it sufficiently clear what I meant.

Specifically, I was writing and reading directly to/from /dev/fd0. I had a floppy that I was able to do this:

boot 1.2.13

dd if=/dev/fd0 of=/tmp/floppy.good bs=4k count=1

boot 2.0.0

dd if=/dev/fd0 of=/tmp/floppy.bad bs=4k count=1

cmp -l /tmp/floppy.good /tmp/floppy.bad

and then eyeball the differences (which would start early) and see that occasionally bytes would be missing. NOTE: I didn't look beyond the first few bytes because I had an easy way to see corruption and I was more interested in finding out under what circumstances I'd see corruption and which ones I wouldn't. I *never* saw corruption under 1.2.13. I always saw corruption under 2.0.0, then initially I didn't see corruption under 2.0.3 so I let my guard down, but then later in the day (after I had switched the floppy for the battery and back and done a few other things as well), I saw corruption under 2.0.3. The strange thing is that I could then boot 1.2.13 and see no corruption, reboot 2.0.3 and see corruption.

Today I'm not seeing corruption under 2.0.3, so it appears that the machine gets into a state where 1.2.13 will always work but 2.x will fail, but that the machine isn't necessarily in that state if it's been turned off for a while.

    Cliff> :I tried doing a make config with a minimal setting
    Cliff> (e.g. no network, :no sound, etc., and with the floppy
    Cliff> directly compiled in), but even :the minimal version had
    Cliff> this problem.  I could reboot 1.2.13 and the :floppy would
    Cliff> work fine.  : :It was then time to sleep, so I put the
    Cliff> laptop away, came back and :decided to try 2.0.3.  Yay!  It
    Cliff> worked, or so my test case suggested.  :I posted to
    Cliff> comp.sys.laptops that 2.0.3 had solved my problem (I had
    Cliff> :mentioned the problem in relation to a query about
    Cliff> Millenia Transports :and Linux) and went about doing
    Cliff> non-floppy stuff.  : :However, my test case really was just
    Cliff> doing a : :tar tfvz /dev/fd0 : :that had a raw .tar.gz
    Cliff> catted to it from another machine and then :seeing that tar
    Cliff> didn't complain.  I wasn't actually *using* the data.

    Alain>  If tar didn't complain when detarring (and most
    Alain> importantly unzipping) that file, it _was_ probably ok. No
    Alain> need to use it: the gzip algorithm is very sensible to even
    Alain> the tiniest perturbation, and if there were a problem, you
    Alain> _would_ have seen it at that stage. If the _contents_ of
    Alain> your tar file was corrupted, it became probably corrupted
    Alain> before it was tarred up.
I think you're correct. Most likely when I did my first test the floppy drive *was* working correctly, like it is now. Something happened later in the day that resulted in the weirdness. I assume it will happen again and I'll investigate more (I had an important meeting to prepare for earlier so I couldn't do as much testing as I'd like).

    Cliff> :Later in the day when it came time to use the floppy I
    Cliff> found that it :was still fried.

    Alain>  Please be more precise. If you complain that it 'is still
    Alain> fried' you might get responses like 'then put it into the
    Alain> fridge' :-). What happened exactly (output...). What
    Alain> happens with simple test cases such as cat /dev/fd0
    Alain> >imag.file. Do you get a file that is shorter than 1474560
    Alain> bytes, or what?
I never looked at the size since the corruption always happened so early into the floppy access and at the time I was primarily concerned with moving data around in preparation for the meeting. I did look at the first few blocks with cmp -l because I was curious. I didn't know the problem was going to be semi-intermittent so my curiousity was momentarily sated.

    Cliff> : I could try it under 2.0.3 and it wouldn't work, :then
    Cliff> reboot 2.1.13 (without powering down or changing the floppy
    Cliff> or :running Windows or anything) and it would work, then I
    Cliff> could reboot :again to 2.0.3 and it would fail.  That
    Cliff> suggests that my problem isn't :a flakey floppy drive, nor
    Cliff> is it some weird configuration issue that :was getting
    Cliff> changed when I was power cycling or removing the floppy
    Cliff> :drive from its bay or anything weird like that.

    Alain>  This whole thing is extremely odd. Missing bytes (if
    Alain> indeed that is what is going on) are a serious issue, and
    Alain> would probably provoke a kernel panic right away. Indeed,
    Alain> much kernel code heavily depends on the fact that a disk
    Alain> sector is always 512 bytes, and things would break down
    Alain> rather quickly if ever it turned out to be shorter.
Well, missing bytes is indeed what cmp -l was showing me. I haven't looked at the floppy driver code, but often block devices return how many blocks have been read, and they have no way of returning that 500 bytes of 512 have been transferred, with 12 lost, because that *shouldn't* happen. My guess is something weird is happening during dma.
    Cliff> :dmesg doesn't show any error messages coming out when I
    Cliff> try to access :the floppy (unless I try to mount a
    Cliff> filesystem in which case the :filesystem mount code usually
    Cliff> issues error messages since the :filesystem will appear to
    Cliff> be corrupted).  : :I'd be happy to run tests over here to
    Cliff> help debug this, but I don't :know much about Linux kernel
    Cliff> debugging.

    Alain>  First, let's make sure we know exactly what happens. Work
    Alain> from an ext2 partition on your hard disk (to make sure to
    Alain> avoid binary translation problems on msdos, umsdos and vfat
    Alain> filesystems). Make a file of exactly 1474560 characters
    Alain> (you can do this by concatenating several large files,
    Alain> until you have more than the requested amount, and then
    Alain> trim it down to 1474560 characters by dd if=file1 of=file2
    Alain> bs=1024 count=1440). Then cat it to floppy using a known
    Alain> working system (for example 1.2.13). Reboot into the broken
    Alain> system (2.0.3), and cat it from the floppy to another
    Alain> file. Check the size of the new file. If the size is lots
    Alain> smaller than 1474560, check how many bytes are actually
    Alain> missing. If the missing amount is rather large, and a
    Alain> multiple of 512 bytes, your disk might be bad, and there
    Alain> could be unreadable sectors. If the missing amount is
    Alain> rather small, and not a multiple of 512 bytes, then the bug
    Alain> really exists. Try 'cmp -l oldfile newfile | head' to find
    Alain> out where exactly the first byte is missing.  It would be
    Alain> most interesting to see whether this happens at a sector
    Alain> boundary, for example. Knowing what value is missing might
    Alain> be intersting too 0x0a or 0x0d are very suspicious.
I don't need to do that, since I already determined that the missing bytes were very early into the data (within the first 20 bytes). I'll do more testing of my own devising when the laptop gets into this weird mode again. When it was first happening I didn't realize it *was* a weird mode, since it looked to me that simply 1.2.13 worked and 2.0 didn't. It wasn't until I brought 2.0.3 over and found that it worked initially and then failed later in the day that I realized that an already bizarre problem was even more so (which isn't too surprising, since the obvious bugs get fixed near immediately). In both cases when 2.0 and 2.0.3. were biting me, I'd reboot to 1.2.13 just to verify that 1.2.13 would always do the right thing, and it did.
    Alain> If on the other hand, both files have the same size, cmp -l
    Alain> them to make sure that we don't have a different kind of
    Alain> corruption.

    Alain> If they turn out to be equal, repeat the experience
    Alain> different times of the day, under different patterns of
    Alain> load, until something interesting turns up.

    Cliff> : It's literally been more than :a decade since I used to
    Cliff> get paid to create or hack up UNIX device :drivers and the
    Cliff> tools have changed greatly (gotten *MUCH* better).  :
    Cliff> :--Cliff [Matthews] :ctm@ardi.com
Seriously, I did write UNIX device drivers back then. I added support for SCSI disconnect and reselect to a 680x0 based UNIX SCSI driver back in the days when Sun-2s still totally locked up everytime you rewound the tape (because they didn't do disconnect and reselect). I may not know my way around the Linux kernel with my eyes shut, but I apologize for making my original bug report sound like I was just wet behind the ears and I couldn't get tar to work.