HM 4.3 latest dev snapshot - Reliability?


 

Jerry Allen

New member
I have had some issues with HM today. The 1st day of operation.
Hardware
Known good RPI 3b+
Known good sd card
Know good RF quality signal is great with 99% quality @72mbps, no dropped packets wile pinging
Power supply is 12v regulated 5 amp. the scope see little to no noise from the supply.

1. Serial Error and no HM hardware detected. Restarted linkmeterd and it worked. Did not require a reboot or power cycle of the PI, nothing logged that I can find.
2. Trying to use PitDroid all was working then poof it worked no more. Reset my phone, reinstalled pitdroid still no go but a browser pointed at the same ip/port would work. I reboot the HM and PitDroid started working. All the while I have an external process that I built for another grill controller that is polling the new HM every 30 seconds so I can get it integrated and that process never missed a polling set.

I have remote logging turned on and there is nothing logged about the serial error or about this inability to connect.

Any suggestions? I will put this on a battery to eliminate power issues.....

Jerry
 
I'm not sure where to tell you to look to try to track this down, since posting the new snapshot last week (November 20 2018 12:25:09 EST) I've had both a 3A+ and B running. The 3A+ is pretty much identical hardware from a parts-in-use standpoint (wifi and CPU), right?. Checking their graphs I don't see any dropouts over the past 24 hours, although that doesn't say if there have been any dropouts in the serial communication over the past week. They both still respond to web requests, but I don't have something polling them every 30 seconds.

I would say first thing would be to separate the two issues. Don't use PitDroid and just use the webui and see if there's any issue with lost communication. I don't think there will be an issue though, because your external process was still pulling data, which meant the HeaterMeter -> linkmeterd -> lmclient -> uhttpd path was fully operational. The web server should be able to support up 20 simultaneous requests and 100 ongoing connections. You could check /luci/admin/status/processes in the webui when it stops working and see if you have 20 or more uhttpd processes going, which would indicate something is holding onto them for some reason.

If it still is acting up, then maybe you can provide information about your external polling process and I can set that test up here as well to see if I can reproduce it under similar conditions.
 
Hi Bryan,

I took the HM apart last night and re soldered everything (not something I wish on anyone) I also placed some butyl rubber spacers to stabilize the 3 boards in the HM and placed a small dab of hot glue on the connectors so there is no way the connections can move then put it in the case. The external poller has ran for 14 hours now with no issue without pitdroid running. I'll start pitdroid as well tonight if I can keep it running on my phone and at the same time keep the poller running i'll test for another 24 hours. There are 3-4 uhttpd processes and there never seems to be more than 4 TCP sockets in condition TIME_WAIT and 6 TCP_LISTEN on the HM, 1 of which is IPv6 that never gets used.......

Issues could have been a fluke or I could have had a component that was not fully soldered.

I had issues with the CyberQ running out of sockets/listeners so I know what you are referring to. CyberQ after 5 or 6 years is wore out, no lcd and probe connections no longer reliable etc. but the cost of them is not something I want to endure again......

The poller is really quite simple, it opens a connection, gets the json data auto closes the connection and parses the json. Eventually it will stuff this into my homegrown software that has an area for notes/recipes/etc.

Here is the unfinished PHP code.....
<?php
$json_string = 'http://10.255.254.95/luci/lm/hmstatus';

$jsondata = file_get_contents($json_string);
$obj = json_decode($jsondata,true);

//date time
$tstamp = $obj['time'];
//var_dump($obj['time']);
//echo "$tstamp\n";exit;
//Grill Set Temperature
$set_temp = $obj['set'];

//Name
$p0_name = $obj['temps']['0']['n'];
//Temperature F
$p0_temp = $obj['temps']['0']['c'];
//Degrees Per Hour
$p0_dph = $obj['temps']['0']['c'];
//Alarm Low
$p0_a_low = $obj['temps']['0']['a']['l'];
//Alarm High
$p0_a_high = $obj['temps']['0']['a']['h'];
//Alarm Ramp
$p0_a_ramp = $obj['temps']['0']['a']['r'];

//Name
$p1_name = $obj['temps']['1']['n'];
//Temperature F
$p1_temp = $obj['temps']['1']['c'];
//Degrees Per Hour
$p1_dph = $obj['temps']['1']['c'];
//Alarm Low
$p1_a_low = $obj['temps']['1']['a']['l'];
//Alarm High
$p1_a_high = $obj['temps']['1']['a']['h'];
//Alarm Ramp
$p1_a_ramp = $obj['temps']['1']['a']['r'];

//Name
$p2_name = $obj['temps']['2']['n'];
//Temperature F
$p2_temp = $obj['temps']['2']['c'];
//Degrees Per Hour
$p2_dph = $obj['temps']['2']['c'];
//Alarm Low
$p2_a_low = $obj['temps']['2']['a']['l'];
//Alarm High
$p2_a_high = $obj['temps']['2']['a']['h'];
//Alarm Ramp
$p2_a_ramp = $obj['temps']['2']['a']['r'];

/Name
$p3_name = $obj['temps']['3']['n'];
//Temperature F
$p3_temp = $obj['temps']['3']['c'];
//Degrees Per Hour
$p3_dph = $obj['temps']['3']['c'];
//Alarm Low
$p3_a_low = $obj['temps']['3']['a']['l'];
//Alarm High
$p3_a_high = $obj['temps']['3']['a']['h'];
//Alarm Ramp
$p3_a_ramp = $obj['temps']['3']['a']['r'];

echo "T $tstamp n $p0_name temp $p0_temp spl $p0_a_low\n";
echo "T $tstamp n $p1_name temp $p1_temp spl $p1_a_low\n";
echo "T $tstamp n $p2_name temp $p2_temp spl $p2_a_low\n";
echo "T $tstamp n $p3_name temp $p3_temp spl $p3_a_low\n\n";

?>
 
Well you may be onto something here. I have had my browser window open all morning streaming from one of my HeaterMeters. The connection was working fine, I ran that PHP code... and my browser dropped connection dropped off around the same time. From there, I got the "No HeaterMeter detected" on the configuration webui. linkmeterd was still running and recording data, but it stopped servicing requests from the UNIX domain socket that's used to communicate with it. I still showed two ESTABLISHED connections to my web browser (maybe Chrome keeping them open?) and nothing going to the machine I ran the PHP from. lmclient times out trying to make a request from the command line as well. Restarting the linkmeterd fixes it, and no data was lost, but there's clearly something amiss here.

So yeah, that's certainly weird. These are all what I'd call normal accesses which usually work flawlessly for days on end, so I'm going to try to see if I can come up with a reproducible test scenario. I've got your polling script running every 30 seconds and it all seems to be working so far, but let's see for how long and if it is related to THIS snapshot specifically, or something else under the hood (which seems unlikely given how stable the system usually is).
 
The bad news is that it ran all day and all night with a browser window continuously and with the php script polling it every 30 seconds and everything today is still fine. It might be a problem with the message size coming back over the UNIX socket, maybe there's some breakpoint where if the data is exactly X bytes long, the code has a bug that locks the service thread. I had done testing on all message sizes under, at, and over the buffer size and it worked as expected, but perhaps a bug has snuck in there over time. I'm still keeping my eye on it and if you find any way to reliably make it hang, please let me know so I can use that to help track it down.

Yeah the write API access is broken in the snapshots still. The LuCI people have the ability for third party controllers to roll their own auth mechanisms, but it isn't formalized so they change it all willy nilly a couple of times every dev cycle as they evolve the core sysauth mechanism. Right now it is broken because there's no way to override it without entirely re-implementing ALL of the authentication system, it is sort of sealed where it wasn't before. They've also removed the ability to authenticate without going to the login page first, which is a problem for API-based accesses. I was planning on taking another look at it once the development settled down, but then they re-merged LEDE with OpenWRT so I'd have to port the project back over to that and I just haven't had the time yet.

So long story short, I think all 3rd party write access is broken unless you write a complicated script to use the login form to get a proper auth token and key for a full session.
 
My HM has ran 24+ hours with 3 probes and a jumper wire in the pit no issues as well, I only had 1 food probe and no pit jumper when it locked up, I'll remove the jumper and 2 probes today.

Better logging would be nice but I find the logging mechanism to be severely lacking unless I cannot find them. Seems to me if something locks there would be some indication some place in a log. There is a missing kernel module causing "error: 'net.nf_conntrack_max' is an unknown key" but I doubt that is the issue, still would like to remedy that one. I am going to give building HM from git source but I am very new to LEDE so that learning curve may take a bit. I would like to have some things that have been stripped out of the HM implementation of LEDE.

I can work out the API access for my purposes in php using curl I think but that is secondary in my mind. If I ruin a $100 piece of meat over a lockup issue.............. argh!



Jerry
 
The error: 'net.nf_conntrack_max message is when you go to the OpenWRT Status page (the main one on the top left) and it tries to query sysfs for information relating to the WAN NAT status. The firmware isn't set up to be a router, so it complains that there isn't any information. So many people ask about this but the only way to squelch it is to patch out parts of the standard LuCI webui and I don't want to have to maintain it.

I don't know what good a log would do. I mean the only thing I can log is if there's a web request and that's not going to help anything. You know there's a web request that's not being serviced. If it is blocked in an I/O operation (to the UNIX socket) then it's not going to log that it is locked because it is locked. It didn't crash or do anything wrong, it just hung up I think. It does spit out some messages, like if it has to reset the database when the time skips forward on boot, or there's serial information that fails checksum, or multiple updates come in the same second. There's a few other items as well.

The good news is that HeaterMeter doesn't care what linkmeterd is doing. It runs the grill control on the microcontroller and just broadcasts its status and listens for changes in settings (like the setpoint). It will keep chugging away controlling the grill even if you rip the Pi off of it completely (either before it is running or while it is running). That's the safety of the dual processor (well, CPU and uC) system, the mission critical systems keep running no matter how much the monitoring system craps out.
 
The REST API should be functional again in the latest snapshot from today. I don't know if PitDroid uses the API or the old login system though, so that app may or may not work.
 
It uses the old login system as there is no place to put the API key....... No issue the web interface works fine from a droid anyway.......
 

 

Back
Top