| khaled | cdf | online | elog | ace help | klute |
| mit | phys dept | qft | lns talks | career | ocw |
| emerging | dev-art | 10x10 | lemonde | photo | orisinal |
evb2: ace control panel
evb2: write documentation
l3: pager, maintenance, etc training
qsub_fcdflnx
l3 kill node
l3 panel: check farm
new perspectives talk
l3 monitoring: hyperthreading
ace shifts
mit qft
2005.06.15
Wishlist fulfilled, with possible exception of first item, which can be discussed later today. Moving on to more ACP fun...
2005.06.15
Wishlist from Markus:
2005.06.14
Fixed the Check EVB Proxy and Check SCPUs buttons. The latter now uses a shell script using rgang instead of the old expect script.
2005.06.13
One more entry for this productive day. I thought it might be of use to include the definitions of messages that Steve mentioned, so as to have them visible from this page:
The new event builder will not need many of the functions now present in the Ace Control panel application; truth to tell, the old EVB didn't need most of these either.
Khaldoun can start preparing a new control panel by eliminating the GUI widgets used to ask for obsolete functions as well as the function calls themselves. The step after that is to re-implement the remaining functions using hand-written Daqmsg classes instead of the classes now provided by the Scanner Manager or SCPU code packages.
Scanner Manager control/monitoring
----------------------------------
The remote procedure call interface obeyed by the Scanner Manager are described in the file $CDFEVB_SM_DIR/src/control/Sm.idl (setup cdfevb_sm). The name of the interface is "Control", Java package cdfevb.Sm.
The following remote procedure calls are no longer needed:
areYouOk Pings the SM.
begin Starts the SM application.
kill Stops " "
reboot
dump
getPartStatus Find the state of a partition.
putConfig Insert a new VRB configuration
(slot numbers, channel masks)
getInternalState Show a lot of VxWorks-related stuff.
recover Force a recover.
We keep the following:
getStats Really detailed statistics.
getShortStats Really brief - just one counter.
resetStats Set all stats to zero.
getConfig Retrieve VRB configuration.
Changes to the data returned by the operations we keep. Name and value
changes may require changes in the display GUI.
struct ScpuStats
----id is now the SCPU's permanent subsystem ID.
struct BuilderStats is no longer needed
struct TeamStats will be renamed to SubfarmStats
----teamId becomes subfarmId
----bldReady becomes subfarmReady
----bld is omitted
struct PartStats
----partNum will be in the range 0 to 7, not 1 to 8
----loading, waiting, etc., will be in units of millisec
----bldReady becomes subfarmReady
struct SmStats becomes ProxyStats
----loading, etc., will be in msec
----bldReady becomes subfarmReady
----bld is omitted
struct FullConfig
----feNumber is omitted
----scpuNumInPart is omitted.
- Steve
2005.06.13
The ACP for EVB2 is now working on eproxy01. Some functionality has been removed so as to be put back in in accordance with the new system. In particular the directories ace/sm/ and ace/scpu/ become respectively ace/proxy/ and ace/scpu/ but have been emptied of all the classes there since we will connect with the RTServer instead of CORBA for EVB2 (i.e. these directories contained classes that allowed the ACP to connect to the SM and SCPUs and pass some messages back and forth; however, we will connect in a different way, using the RTServer, and so those classes will be rewritten as discussed below).
2005.06.13
It's official, or almost. The ACP is moving to the cdfevb2 package. I find this will make development much more natural. I'm spending the day removing all dependencies on cdfevb and recompiling. For two birds/one stone, I'm also making sure this can be done on eproxy01 so we can move it there.
2005.06.10
For examples of daqmsg classes:
on b0evb2gate: /home/khaldoun/cdf/cdfevb2/proxy/java/EvbList.java
on b0dap30: $FER_DIR/src/runControl/phys/EvbReadoutList.java
on b0dap30: $CDFEVB_PROXY_DIR/EvbProxy.java (for conn to RTServer)
2005.06.07
I've copied and pasted some code and comments at Markus' request to explain what messages should be passed between the ACP and the Sm/Scpu into this file: newmessages.txt.
2005.06.04
Documentation reading day:
Questions/Notes:
2005.06.03
The task at hand: write java classes that extend daqmsg for communication between the ace control panel (?) and the scpu/proxy. The contents of those message are outlined in the Sm.idl and Scpu.idl files in the cdfevb.sm and cdfevb.scpu packages respectively.
Discussing with Markus what I could use as an example class to guide my learning here...
Also, previous misunderstanding cleared up. The IDL files do not need to be touched. These would need to be changed if we were using idltojava, but these classes will be handwritten by yours truly, so this is not needed.
2005.06.02
Taking the day to learn a bit about Java IDL, modules, etc... so I can understand and modify the GUI widgets for various obsolete functions and the function calls themselves in the ace control panel. Useful refs so far:
2005.06.01
Starting work on getting rid of obsolete GUI widgets and function calls for the Ace Control Panel (in the SM and SCPU packages of cdfevb). Brief discussion with MK regarding whether I could make a new package in cdfevb2 called ace and copy this code there so as to make a break with the old Panel. After consideration of the extra work involved in that (e.g., there is no SM package in cdfevb2 since there is no SM. This would require serious reworking), I see now that it's preferable to keep this tool under the cdfevb code. More on this tomorrow...
2005.05.23
Did some refinements of the EVB start/stop button. Need to speak to Steve about "Expert's Corner"... Owl shifts are a drag.
2005.05.17
Started working on EVB2 panel. Removed following buttons:
New code is implemented for Check EVB Proxy, which checks whether the EVB proxy is running on any of 8 partitions. New code for START EVB proxy that does the same check and then starts the proxies on all partitions.
2005.04.27
Brief convo with Markus last night. He mentioned one of the challenges ahead for us: in the new Event Builder, there can be one proxy per partition, instead of one single proxy. Actions such as "Clean up EVB" or "Stop Proxy" will have to be implemented differently in order to take this into account. Aside from that, This remains a long term issue (weeks) and update will not happen too soon.
2005.06.10
Worked on "Networks" note for EVB2. Markus wrote a first draft and I contributed some comments/corrections. The note is available here.
2005.06.01
Added network diagrams showing the various networks used by the new EVB: private Gigabit Ethernet (online), private Gigabit Ethernet (Test Stand), private Fast Ethernet Boot Network, and connections to the outside world (Fnal "Public" Online Network). These currently reside on b0pcmitxx, but will probably end up elsewhere when they're moved around to be put into this or that note/webpage.
2005.03.01
After about 3 weeks of familiarisation, a first draft of a new EVB documentation is available. It is incomplete at the moment, but was a good way to start learning about the EVB upgrade. So far, only the note for Aces has been drafted.
2005.06.15
From the elog:
Level3 was paged but in fact it was CSL that was giving trouble.
The level3 display was green (waiting to output) and consumer csl monitor was red (presumably not working).
After we paged CSL and they restarted it.
things are back to normal.
2005.06.15
First week of pager starting yesterday. Got a page this morning while aces were running L2 torture: Relay failure. Here is the text from the elog:
Walked into CR as L3 problems were appearing. Relay Server troubles on converter 13. L3 proxy got stuck in trying to recover. Eventually, rebooted c13, cleaned up all L3 processes including relay servers, and back to running.
---khaldoun and arkadiy
2005.06.07
Seems that most issues on the list are ok for now. Ilya strongly recommends reading the L3 Expert Manual found here.
2005.06.04
List of recent L3 pager problems found here.
2005.05.26
Made major changes to script. Kept old one as qsub_fcdflnx_tar since it works just fine for a small tar file (<2GB). New script is called qsub_fcdflnx_scp, and uses a second script to copy things over back from the CAF, qsub_copy_over. Yes, this is a horrible kludge, but after hours of fighting with cshell and then with bash, i just couldn't resolve everything under one roof. So if the current solution works (testing now), I'd be in favor of using it regardless of elegance.
2005.05.25
Previous problem fixed (been meaning to ask what it was... will update when I know). Now we would like to have the script scp the release_dir to the caf instead of tarring it (2GB limit is getting in the way).
2005.04.30
Bruce seems to have found the reason for the problem he was having. From his email:
From your output it looks like I need a directory on fcdfdata051.fnal.gov:
Segment number from 1 to 1
Initial command: ./tmp_caf_20005.tcsh
Output file :
khaldoun@fcdfdata051.fnal.gov:/cdf/scratch/khaldoun/icaf/20005.$.tgz
Submitted : Wed Apr 27 23:17:48 2005
Ended : Wed Apr 27 23:20:10 2005
I seem to not have one:
So we need directories there for $USER == klute, gchouda, and knuteson.
2005.04.26
Latest attempt: success. Perhaps we have different versions of the script?
2005.04.20
Email from Bruce regarding discrepancy in success status between his running of the script and mine:
[...]
Now I execute
cd fcdflnx4:/cdf/scratch/knuteson/cvs/current/
./bin/qsub "ls > yo.txt"
I do not get a yo.txt file, but instead get an email such as the one
attached.
Does qsub "ls > yo.txt" work for you?
Thanks,
[...]
Still need to check this to verify that I can either reproduce his error, or explain why I do not get it...
2005.04.16
qsub_fcdflnx was written to automate the submission of jobs to the main CDF CAF from the fcdflnx nodes. The goal here is to recreate bit by bit the environment we have in our release directory, with data files in certain places, and executables in others, without worrying about the fact that this will run elsewhere. Recreating the directory tree is the main goal. This script was written this summer (summer 2004 version), and brought to near-completion. Other things distracted me, and so I get back to it now to see if we can bring it to completion.
After dealing with several issues related to an old CAF configuration files (grabbed latest .cafrc files from Guillelmo), I now test the script. As far as I have seen, this fulfills the desired purpose. Specifically, it copies the directory structure of the release directory, sends it to the caf, removes broken links and replace with dirs, executes the command, then copies new files back to the release dir on fcdflnx. Emailed Bruce to ask what goes wrong when he executes this.
2005.04.13
Problem: we want to submit jobs (i.e. a quaero analysis) without worrying about computing power available. So we want to use various computer farms where we have accounts. However, we'd like to avoid having to store files in different places, change configurations, environments...
Solution: we wrap the executable in a qsub_nodename script, where nodename is in this particular case fcdflnx. We use this script to automate all adjestments that must be made in order to submit the job on that platform.
2005.05.13
Done. Useful Java classes: JDialog, JPanel, JTextField... Basically, this involved a reworking of the code for logging into the Hardware Database, which was already available in the ACP. Next: final testing and commissioning. However, it would be nice to have this all in CVS first. Time to bug Markus.
2005.05.10
Ilya has asked me whether I could incorporate into the Ace Control Panel a button/option to kill a single L3 node. This is useful if there is a high rate of reformatter errors from one particular node, which has been happening rather often lately. Until now, the solution has been to pop up a window in the main RC that tells the Ace which node has been problematic and instructs him/her to page L3. We can bypass this page if we put this funtion in the Control Panel.
2005.05.05
The new control panel has gone in. Wrap up consists of: putting everything in CVS, adding 0/1 return value to Util.doCommand, and hoping it all goes well.
2005.05.04
The list below was taken care of point by point. It turns out that rgang is perfectly capable of doing the ssh into all ~250 nodes in parallel. The error I saw must have been something else. With a little bit of pain and help from Ron, rgang was installed on pcom2.
Finally, it's worth noting that the installation happened twice b/c of an interesting glitch: certain disks we thought were nfs-mounted (and thus accessible from both pcom1 and 2) turned out to be separate replicas on the two gateways. The code actually resides primarily on pcom1 and gets copied 1/week by a cron job to pcom2. so to install software properly on pcom2, you have to make sure that you make changes on pcom1 so that your changes are not erased when code is copied over.
2005.04.30
Test results:
l3_proxy_ops can run my script properly. l3_proxy_ops with my version of the script in the level3 directory since it should not interfere with anything and should allow me to continue testing even during data-taking. 2005.04.30
Setting up the test (can be done before):
checkL3farm.sh, change from using test nodelist to actual latest nodelist (optional). l3_proxy_ops being called.During the test:
.cshrc file so that it points to my private version of the control panel. .env.better script, so that $LEVEL3_DIR points to my private ~/level3/control/ directory and picks up my version of l3_proxy_ops (to include new code for $unk==--farm-check case).2005.04.28
Second order pass:
if() then sequence of checking the farm for testing from $USER == khaldoun when performing tests on my private version of the Ace control panel. l3_proxy_ops and running under $USER == evbproxy. That pretty much takes care of integration into the java panel, and matches the action sequence delineated here.
Still unknown:
2005.04.26
The next task is to incorporate this more fully into the java panel. The series of actions we want:
l3_proxy_ops script checks farm first The current trouble is that the checkL3Farm script outputs to stdout to tell the Ace if something is wrong. What we'd like is to pass some boolean in the java code in order to test for this condition.
2005.04.21
First order pass, now using rgang: Markus kills two birds with one stone, and shows me a great tool, written by Ron Rechenmacher. rgang is a Python program that has the ability to execute a script/command in parallel, branching at several levels, and even copying itself to nodes that do not have it installed. The cherry on top? It allows you to control timeout for the ssh/rsh.
It's obvious that this has enormous utility for us. With this knowledge, and a little bit of trial and error, I have now a first order pass at this script. It is linked here for archival purposes. It is a shell script that should run under l3proxy and that is called when you want to clean the farm (i.e. as part of l3_proxy_ops). This has been tested and shown to work.
2005.04.20
A thorough talk with Ilya has clarified many aspects of this project. The primary conclusion is that the zeroth order pass explained in previous entries lacks important funcionality and goes through the wrong channels.
In terms of functionality, the primary missing piece is key to the success of this script. Due to the low frequency of problems when cleaning up L3, we must execution time into careful consideration. Rephrasing Ilya's formulation, if we lose 40 minutes every time we get stuck cleaning up, and we get stuck only 1 out of 20 times that we clean up the computer farm, then the script does not have any advantage if it takes 2 minutes to run and it is run each time we do a clean up. The numbers used in this example are pretty representative, so let's set a goal of about 30 seconds for execution time. This requires that we be able to control the ssh timeout, something which is not easily doable in shell, but is effortless in expect. Nothing is for free though, since I have never used expect and would prefer not to learn a new language unless it is truly time-efficient to do so.
Second, Ilya pointed out that evbproxy should not have permission to log in as l3proxy , effectively killing the idea for a script of the kind I had described. The way these actions are currently performed (stopL3proxy and startL3proxy require similar infrastructure since only the l3proxy account can do this) is as follows: when the button is pressed, a string containing the command is passed using a specific socket (hard-coded). An inetd.conf daemon running under l3proxy picks up the string and passes it to l3_proxy_ops, a shell script with cases for "stop", "start", "clean" that in turn can call $L3_CSHDIR/l3_all_nodes_exe, a long and complex expect script that can kill all nodes, check all nodes, etc...
Phew... if it sounds scary, it's because it is, with all due respect for the hard-working students who wrote these quite powerful scripts. The thing to do then -- it seems -- is to pass a new string, "farm-check", and modify l3_proxy_ops so that it calls my script when this string is passed.
2005.04.18
Quick note about the node list. The Level3 proxy updates a list regularly and puts it in $L3_LOGDIR/all_others/l3_nodes_list. The ASCII file contains several comments and text we don't need. Simply grep for 'b0l3' since all nodenames begin with that, and then use a simple while loop in the shell script to end up with a space-separated list of all nodes.
2005.04.17
At zeroth pass, we just want to execute the following, then put it all in a nice script that will catch the output to look for errors, call that script from the java button, and call it a day:
[evbproxy@b0l3pcom1 ~]$ ssh b0l3001 hostname
[evbproxy@b0l3pcom1 ~]$ ssh b0l3002 hostname
[evbproxy@b0l3pcom1 ~]$ ...
There are several reasons why you cannot go home for margaritas yet. First, the evbproxy account, which runs the Ace Control Panel (ACP) is crippled so that the aces don't end up shooting themselves in the foot accidentally. As a consequence, the evbproxy acct cannot actually log into L3 processor nodes.
However, the evbproxy acct has the ability to log into the gateways pcom1 or pcom2 as l3proxy and thus gains the ability to execute any script it likes with ssh, including one where it logs into L3 proc nodes (It has come to my attention that this is not supposed to be allowed. This solution was discarded for this reason ). Next logical thing to do: write the script and keep it in the L3 directory and do something akin to the following:
[evbproxy@b0l3pcom1 ~]$ ssh -l l3proxy b0l2pcom1 `login_script.sh`
This has now been tested and shown to work. The next step is to figure out a nice way to grab the latest node list and iterate over that instead of hard coding the nodenames, since the online node list changes so often.
2005.04.15
The problem is the following: at a low but significant frequency, cleaning up the Level 3 farm causes L3 to get stuck, the expert to be paged, and circa 45 min of data to be lost. We consider this issue to be easily solvable: the cause of the problem is that some nodes are down (can ping, but not ssh), which makes cleanup impossible, but the Aces have no way of knowing this under the current scheme.
The task is then the following: we want to modify the L3/EVB Ace Control Panel (horrendously named, btw, but sadly the name stays) such that the Cleanup L3 button automatically causes a check of the farm which will give feedback to the Ace and alert him to potentially risky conditions. From there, we can give options such as (for instance, only for demonstration purposes):
2005.06.10
The talk given at New Perspectives was a shortened version of an older talk. The new pdf can be found here.
2005.06.05
Reworked old turbosim talk to get it down to 12 min. Current draft soon...
2005.06.01
Looks like Georgios inherited this job. Case closed for the moment.
2005.04.27
The new Level3 farm will be made up of nodes with two CPUS, each with HT enabled, which (reason not clear, but verified with Christoph) shows up as 5 CPUS when doing something like top. We'd like for this to be visible in the L3 monitoring display which runs in the control room at one of the DAQ Ace consoles. This involves a little bit of java coding to enable monitoring of 5 CPUs instead of the current 2, as well as reflect this monitoring with a little graphic.
In all likelihood, this project will not move forward until the L3 check farm implementation is finished.