I am running a python code to do some web scraping, which means the code will occasionally write (append) data to a text file. Sometimes the code will just freeze up but no error message would show up in Shell. I would like to know if this is more likely because the raspbian system is not reliable or it is because my code has some hidden issues.
A very good rule of thumb is that the fault is always in your own code and not in the system. In particular, if the rest of the system is running (i.e. you can use another console), then it is almost certainly the fault of your program for hanging.
In the shell that is running your process, try pressing Ctrl+C to ask your process to stop, or Ctrl+\ to just quit it alltogether. You should get an error message that shows where your program were when you terminated it. Let's assume your program is
x = 0
def bar():
return x * 2
def foo():
x = 11
while bar() < 100:
x += 1
foo()
(Can you spot the error already?) I'm now running it with python program.py
, but it hangs and does not terminate. Pressing Ctrl+C yields:
^CTraceback (most recent call last):
File "python.py", line 9, in <module>
foo()
File "python.py", line 6, in foo
while bar() < 100:
File "python.py", line 3, in bar
return x * 2
KeyboardInterrupt
The ^C
is a visual representation of Ctrl+C and can be ignored. The rest is a stack trace which shows what the program did when we were running it. For good effect, interrupt your program multiple times and compare the stack traces to see what it usually is doing when "hanging".
Additionally, extend your program with more debugging output (possibly toggled by a new switch) so that you don't need to interrupt it in the first place. A good idea is to always output something before and after you are connecting to the network.
Since you are on a network, the other side may also just die silently and your program may wait some time to confirm that that's what's happened (as opposed to just a slow-down because of bad reception or high congestion in the network). You can call socket.setdefaulttimeout
with a low value to make your program exit early instead of waiting for the other side to say something.
You can also use various tools to aid debugging. For example, type htop
(sudo apt-get install -y htop
once to install it if you haven't already, alternatively top
works too) to see how your program is progressing. Have a look at the CPU load factor (in the very top) and where your program is listed.
Say it looks like this:
Despite sorting by CPU (press F6 to sort), our program does not even show up here, and htop is the only program using much of the CPU anyways. This means that our program (if it's running) is stuck in a system call, i.e. has yielded control to the operating system. But since the operating system isn't using much CPU either (its CPU usage is mapped in red), it looks like we're waiting for something!
On the other hand, the htop output my look like:
You'll notice that program.py
features prominently. It's also not because our program stresses the operating system in any way, since the bar is basically all green!
Then you may want to investigate the current state of the program a little bit better than just looking at aggregate values or killing it. There are numerous tools, but lets look at two:
The strace utility (again, install once with sudo apt-get install -y strace
) can show what system calls a program makes. That works for every program, not only Python programs. Running it on our simple example program yields:
$ strace -o log python program.py
execve("/usr/bin/python", ["python", "program.py"], [/* 47 vars */]) = 0
brk(0) = 0x218f000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbb72742000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
* snip about 1000 lines*
read(3, "x = 0\ndef bar():\n return x * "..., 4096) = 101
lseek(3, 101, SEEK_SET) = 101
brk(0x291d000) = 0x291d000
read(3, "", 4096) = 0
brk(0x2914000) = 0x2914000
close(3) = 0
munmap(0x7f7b9d294000, 4096) = 0
If you want, you can also run it as strace -o logfile python program.py
to write the output into ./logfile
so that you can examine it in another shell, for example with a text editor.
We're seeing here that Python takes a lot of system calls to even just start, but our program makes no system calls at all! We see that because the last system calls is Python reading in the source code file. With a different program, the output may look like
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("93.184.216.34")}, 16) = 0
recvfrom(3,
The important part is the last line, which is unfinished, showing that the operating system is in control at the moment. What is the operating system doing? Well, it's working on a call to recvfrom
, or waiting for data. Since it's just waiting and not actually doing much, htop will show only negligible red bars. If there was a bug in the operating system, htop would show lots of red now.
Now, how do we do know what system call came from which Python statement? For that, we need a Python debugger. Your IDE may have one integrated, but in a pinch, the built-in pdb works. Lets run it on our (new) program:
$ pdb program2.py
> /home/phihag/tmp/stackoverflow/program2.py(1)<module>()
-> import socket
(Pdb) next
> /home/phihag/tmp/stackoverflow/program2.py(3)<module>()
-> c = socket.create_connection(('example.net', 80))
(Pdb) n
> /home/phihag/tmp/stackoverflow/program2.py(4)<module>()
-> while True:
(Pdb) n
> /home/phihag/tmp/stackoverflow/program2.py(5)<module>()
-> print(c.recv(1024))
(Pdb) n
Use next
and step
(or short n
and s
) to step through the program (for large programs, you most likely want continue
and a breakpoint). Type help pdb
to see the pdb help information. In this case, we see the line the program currently hangs on is line 5 (print(c.recv(1024))
).
Now, if you don't understand why your program is hanging, the above debugging tools should give you plenty of information to create a minimal, complete, verifiable example.
Once you have confirmed that that is hanging as well, feel free to ask a stackoverflow question about it.