po dolgem casu.. ideja sama :)

Wed Apr 23 12:21:36 CEST 2003

takole, koncno sm zbral dovol poguma.. ha ha ne koncno sm zbral dovolj 
volje da se mi je dalo tole napisat, vendar ker sem prvec len in ker sem 
zadevo ze napisal v anglescini jo zal tako tudi dobite :) seveda se 
opravicujem za mojo anglescino :))

for a long time i have this idea to build a system that would use bayes to
analyze trojans and worms. now i finally have build some kind of alpha
version of this system but have quickly found out that i have two problems.
(1) my file structure knowledge is too weak to be able to know and make good
"logs" for bayes and (2) my programming skills are too weak to be able to
build a optimized bayes for this. so since i am not able to build it, i am
giving this idea and all the thoughts i have about it in the world and
hoping that someone would be as excited about it as i am and make it real
(it would be very nice if i could be informed about the progress of it -
"see the baby grow" :)

some more details of it.
now the basic of the idea is if i put the right information in the bayes and
tune up the bayes analyses then bayes should probably be able to pick
trojans and worms from "good" programs.
now if i simple but the hex of the program i doubt that i can look forward
fore some good results, but if i am able to build a better "log" which would
include something similar to the logs that regmon and filemon from
sysinternals ( <http://www.sysinternals.com/>http://www.sysinternals.com/ ) 
are able to do and maybe ad to
this info like module dependence, maybe even sys calls (filemon and regmon
are samples of this for reg and files, but there is more like system, disk,
net,..),maybe some debug info, maybe even bin strings would be useful and
maybe something more.. then i think this would be great information that
bayes would be able to use to fish out the trojans and worms.
now the second part of this system is to build a tuned up bayes for
analyzing this logs. at first we would probably need to change some basic
parameters like "number of top words" (compared to spam bayes (popfile @
<http://popfile.sourceforge.net/>http://popfile.sourceforge.net/ ) that 
uses 15 this system would probably
need several more) and several similar small things. the second thing would
be to think what to use for analyzing data (sample: for reg keys  should we
use the whole reg string
(HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Run) or is it
better to use it like words (HKEY_LOCAL_MACHINE, SOFTWARE, Microsoft...)...
now even that IMHO this system like it is at the moment (the good logs and
the tuned up bayes) should work, i dont have much believes it it that it
would work as good as someone would like such a system to work. so here is
the third thing (the next idea) that could IMHO make this system work great.
if i simple give the bayes the challenge to distinguish the difference
between "good" and "bad" programs than i think it will have much problems to
give some good results for use it in real world.. but if instead of simple
making a base for "good" and "bad", i split the "bad" into categories like
"worm" and "trojan" or even more, like "irc_worm", "i_worm",
"mass_mailer_worm",... then the filter for the analyze that bayes would use
should be much more stronger (more specific) and should be able to give some
good results for the real world. this thing can go as far as filtering
worm/trojan families (in this case i am almost 100% sure that if i show to
bases one ore two sample of one worm family (for the sample klez) it would
be able to pick up all others..

some more thoughts.
now about the last sentence above (bayes is able to 100% find all klez
families).. virus writers could of course use this to make the next "klez"
totally different from others so that our system would not be able to
classified it as a klez family member. but in this case (1) i would no
longer say that this is klez, (2) the bayes still would probably be able to
identify it as a worm, (3) even if it would miss it.. it is much simpler to
click on one button and reclassify the bayes then to make 1000 of different
ways to write one worm. (a small thing to this, since we would look only 
for trojans and worms we probably dont need to analyze files bigger 1.5 mb)
now one thing that we learn from the spam is that there are several ways
that can be used to fight such bayes system (like using 1000 of silly words
or in the program/worm this would dumb jumps, and so on) still i belive that
this system would be a mayor step forward from today heuristics analyses and
file extensions filtering.

now a about the using of such a system.. i think that this system would be
totally useless for real viruses (code that infects other "good" programs)
since the logs that we would use to analyze would mostly have data from the
"good" program. so my goal is to use this system mostly for trojans and
worms (and they are today the top "viruses"). so basically this would not be
the next generation of av software but as a add-on to today technology for
the future of computer security.

now abut the real usage of it. the big problem of this system is that we
need to run (probably) the virus to get the "log" so we need to make this in
a safe environment. it is no problem to make a vm (virtual machine) on a
server (mail server for a good sample) where the attachments could get
analyzed, but it is another thing to make this on an end user machine. so my
plans were, first to make an "in lab" testing system where the potential of
this system could be tested (actually i dont know if this system would work
at all :), in the second step (if i pass the first one) i would make a
remote one-file scan server similar to this
<http://www.kaspersky.com/remoteviruschk.html>http://www.kaspersky.com/remoteviruschk.html 
,
http://www.dials.ru/english/www_av/ , 
<http://www.rav.ro/scan/indexn.php>http://www.rav.ro/scan/indexn.php ,...
but enhanced with our bayes system so that this system could get some real
world experience and testing, and if it passes the second stage it could
probably be easy transferred to using it in mail and www servers (or
similar). next it could of course be also used at end user machines (home
users). i think it would be great as a mail plug-in/filter.

phuu, lots of typing (for me :), i am sure i forgot something and sorry for
my english :)

regards, saso

---------------------------------------------------------
saso badovinac [callto: +38631514625]