< Return to Video

Anomaly Detection So Easy Your Grandma Can Do It. No ML degree Required Splunk .conf 2024 Rehearsal

  • 0:02 - 0:03
    Alright. Welcome to another L.A.M.E.
  • 0:03 - 0:05
    Creations video. This is going to be more
  • 0:05 - 0:08
    or less a revisit of the
  • 0:08 - 0:11
    splunk.conf address I gave this
  • 0:11 - 0:16
    previous June 2024 in Vegas. So, let's
  • 0:16 - 0:18
    go give it. It was entitled "Anomaly
  • 0:18 - 0:20
    Detection: So Easy Your Grandma Could Do
  • 0:20 - 0:22
    It--No ML Degree Required."
  • 0:22 - 0:23
    I'm not going to spend much
  • 0:23 - 0:25
    time introducing myself; if you don't know who I am,
  • 0:25 - 0:27
    I'm Troy Moore from Log Analysis Made
  • 0:27 - 0:31
    Easy. Here's my contact information.
  • 0:31 - 0:33
    Alright, what we're going to discuss in
  • 0:33 - 0:35
    this conference breakout session
  • 0:35 - 0:37
    are common baselines you might need. What
  • 0:37 - 0:39
    are some things you might need in your
  • 0:39 - 0:41
    company? How do you create those baselines?
  • 0:41 - 0:43
    Let’s give a demo. I'm not a "death by
  • 0:43 - 0:45
    PowerPoint" person, so we’ll go
  • 0:45 - 0:47
    into a live demo on this. What should you
  • 0:47 - 0:49
    do after seeing this presentation, and
  • 0:49 - 0:51
    some "gotchas" to baselining?
  • 0:51 - 0:54
    Let’s discuss what baselining
  • 0:54 - 0:57
    is. Baselining is the expected values or
  • 0:57 - 0:58
    conditions against which all
  • 0:58 - 1:00
    performances are compared. I’ve given
  • 1:00 - 1:02
    a definition for it. What does that look like in
  • 1:02 - 1:05
    practice? Let’s go the other way. Common
  • 1:05 - 1:07
    baselines--let’s discuss what some are.
  • 1:07 - 1:09
    Maybe a hardware baseline. Can I get a
  • 1:09 - 1:11
    software baseline? Can I get network
  • 1:11 - 1:13
    ports and protocol baselines? User
  • 1:13 - 1:16
    baselines? User behavior baselines?
  • 1:16 - 1:18
    I know that I’ve been working
  • 1:18 - 1:20
    in the cyber world for many, many
  • 1:20 - 1:23
    years, and I will ask, as an auditor, "Do
  • 1:23 - 1:26
    you by chance have an inventory?" And you
  • 1:26 - 1:29
    know what? It's a funny answer. Ask
  • 1:29 - 1:31
    yourself, does your company
  • 1:31 - 1:33
    have a network inventory? How
  • 1:33 - 1:34
    thorough is it? How accurate is that
  • 1:34 - 1:37
    network inventory? Well, if you don’t have
  • 1:37 - 1:39
    that, can you give me a baseline of what's
  • 1:39 - 1:41
    on your network? A lot of people will
  • 1:41 - 1:42
    tell you it's really difficult to give a
  • 1:42 - 1:44
    baseline if you don't have a network
  • 1:44 - 1:47
    inventory. And so, you ask these
  • 1:47 - 1:49
    questions and often don’t get the
  • 1:49 - 1:51
    answers. I know I’ve told auditors year
  • 1:51 - 1:53
    after year at places I’ve worked, "I
  • 1:53 - 1:55
    don't have an inventory. I don't have
  • 1:55 - 1:56
    those kinds of things."
  • 1:56 - 1:58
    What we’re going to do in this
  • 1:58 - 1:59
    presentation is show how
  • 1:59 - 2:01
    Splunk and statistical models can make you
  • 2:01 - 2:03
    that hero. You can be the person who
  • 2:03 - 2:05
    provides that inventory. You can be the
  • 2:05 - 2:08
    person in this presentation who provides
  • 2:08 - 2:12
    baselines and can show what is normal in our environment.
  • 2:13 - 2:15
    And what I'm going to do is,
  • 2:15 - 2:16
    in order to know what’s normal, we need
  • 2:16 - 2:19
    to look at the past. Hopefully,
  • 2:19 - 2:21
    this makes sense. If you don’t know
  • 2:21 - 2:22
    what happened in the past, you won’t
  • 2:22 - 2:25
    be able to know what’s normal. The
  • 2:25 - 2:28
    past is what defines normalcy.
  • 2:28 - 2:30
    So what we can do is look at
  • 2:30 - 2:32
    historical IP logs to see the connections,
  • 2:32 - 2:34
    and that can help us build a baseline. We
  • 2:34 - 2:36
    can track the processes that have been
  • 2:36 - 2:38
    running, and we can build a baseline. We
  • 2:38 - 2:40
    can look at the ports used by systems
  • 2:40 - 2:42
    historically, and we can build a baseline.
  • 2:42 - 2:45
    We can track historical login events, and
  • 2:45 - 2:48
    we can build a login event baseline.
  • 2:48 - 2:51
    Splunk is a logging system. So, if you've
  • 2:51 - 2:53
    been getting those logs,
  • 2:53 - 2:57
    then you have a tool now to build a baseline.
  • 2:57 - 3:00
    Here is the concept that
  • 3:00 - 3:01
    I'll need you to understand
  • 3:01 - 3:03
    to be able to grasp everything else. There are
  • 3:03 - 3:05
    two methods for baselining:
  • 3:05 - 3:08
    there's what I call the rolling window and the allow listing.
  • 3:08 - 3:10
    The rolling window is the
  • 3:10 - 3:13
    easiest way to start a baseline. In
  • 3:13 - 3:15
    my opinion, it’s the most simple method.
  • 3:15 - 3:17
    The concept is we’re going to use this
  • 3:17 - 3:21
    little bar here. We’ve got an x-axis. We’re
  • 3:21 - 3:23
    going to say this is a full line, and
  • 3:23 - 3:25
    this is a timeline. So, this might be one
  • 3:25 - 3:28
    day, a week, a month, a
  • 3:28 - 3:31
    year, or maybe three months.
  • 3:31 - 3:34
    This is a historical part of
  • 3:34 - 3:36
    that time. Let’s say it’s one day.
  • 3:36 - 3:38
    This could be 23 hours. This could be a
  • 3:38 - 3:40
    week; it could be 6 days. This
  • 3:40 - 3:42
    could be a month; it could be 29 days.
  • 3:42 - 3:44
    This could be a year; it could be 11
  • 3:44 - 3:47
    months. The y-axis is a
  • 3:47 - 3:50
    portion of that time that we’re going to
  • 3:50 - 3:52
    look at. The x-axis portion will be our
  • 3:52 - 3:55
    baseline, the historical events.
  • 3:55 - 3:56
    The y-axis is what we’re going to look at.
  • 3:56 - 3:58
    We’re going to say, "Hey, looking at all
  • 3:58 - 4:00
    these events that have occurred, are
  • 4:00 - 4:02
    there any events in y that weren’t
  • 4:02 - 4:05
    in this baseline, that weren’t in x?"
  • 4:05 - 4:06
    If we do that, that’s the definition of
  • 4:06 - 4:09
    an anomaly. Anomalies are things that are
  • 4:09 - 4:11
    not in your baseline. So, we can use a
  • 4:11 - 4:13
    rolling window, and I’ll actually demo
  • 4:13 - 4:15
    how to do that. The other method is allow
  • 4:15 - 4:18
    listing, in which case we build a list of
  • 4:18 - 4:20
    our baseline. Again, you’ve got to figure
  • 4:20 - 4:20
    out how you’re going to build that
  • 4:20 - 4:23
    baseline. But if you do that, you put that
  • 4:23 - 4:25
    list into a lookup file. We can do that
  • 4:25 - 4:26
    by using the outputlookup command, and
  • 4:26 - 4:29
    then we use the lookup command in our logs.
  • 4:29 - 4:31
    We look at all the logs
  • 4:31 - 4:33
    coming in and say, "Do any of these logs
  • 4:33 - 4:36
    not have a matching pair
  • 4:36 - 4:37
    in our lookup?" If so, that would be an
  • 4:37 - 4:41
    indication that this is a new anomalous event.
  • 4:41 - 4:45
    Alright. I’ve given the
  • 4:45 - 4:48
    PowerPoint presentation on that. We’re
  • 4:48 - 4:51
    now going to go into demo time. We want
  • 4:51 - 4:53
    to demo this and show how this works in
  • 4:53 - 4:56
    actual practice. Again, the queries are at
  • 4:56 - 4:58
    the end of the presentation. I’m going to
  • 4:58 - 5:00
    to have to give a link to this PDF
  • 5:00 - 5:02
    so you can grab them if you want to
  • 5:02 - 5:04
    use them, or just slow down the video.
  • 5:04 - 5:05
    I’m sorry, I’ll have to record it
  • 5:05 - 5:08
    that way. But anyway, that’s what
  • 5:08 - 5:09
    we’re going to do.
  • 5:09 - 5:12
    Alright. For this demo, I wanted to
  • 5:12 - 5:14
    make sure that any of you could go home
  • 5:14 - 5:17
    and use this very same thing, the same
  • 5:17 - 5:19
    dataset. So, I went and grabbed
  • 5:19 - 5:21
    Splunk’s freely available Boss of the SOC,
  • 5:21 - 5:25
    referred to as BOTS v3. You could grab
  • 5:25 - 5:27
    v2 or v3, but these
  • 5:27 - 5:28
    things will work with your own
  • 5:28 - 5:31
    data. However, I wanted to make sure that you
  • 5:31 - 5:32
    could do these very same scenarios when
  • 5:32 - 5:34
    you went back home after this
  • 5:34 - 5:38
    conference. So, I went to Google, typed in
  • 5:38 - 5:40
    "BOTS v3," and added "GitHub," and that
  • 5:40 - 5:44
    brought me to this little link here.
  • 5:44 - 5:46
    It’s just an app you can go
  • 5:46 - 5:49
    download. It’s a relatively large app
  • 5:49 - 5:51
    because it contains all the data in an
  • 5:51 - 5:53
    index already pre-indexed for you. So,
  • 5:53 - 5:55
    when you put this in, you’ll have
  • 5:55 - 5:57
    all the exact same data that I’m using
  • 5:57 - 5:59
    in these queries, allowing you to
  • 5:59 - 6:02
    easily run the exact same
  • 6:02 - 6:04
    thing in your own environment. As you
  • 6:04 - 6:05
    learn these queries, you’ll be able to
  • 6:05 - 6:07
    use them elsewhere. All the
  • 6:07 - 6:09
    documentation is right here if you want
  • 6:09 - 6:12
    to use any of these source types--
  • 6:12 - 6:14
    they’re available.
  • 6:14 - 6:15
    And any of the required
  • 6:15 - 6:17
    software needed to run any of the
  • 6:17 - 6:19
    TAs in order to get your data
  • 6:19 - 6:22
    to parse correctly. We’re going to
  • 6:22 - 6:25
    primarily use the stream data and some
  • 6:25 - 6:27
    network host logs.
  • 6:27 - 6:30
    I’m going to jump over to my
  • 6:30 - 6:33
    environment. If I do a head 100 on this
  • 6:33 - 6:36
    little command here, the BOTS v3 stream
  • 6:36 - 6:40
    TCP, this is TCP network traffic, and I
  • 6:40 - 6:42
    can see the network traffic. I can see
  • 6:42 - 6:45
    certs going through. I can see
  • 6:45 - 6:47
    connections with bytes in and bytes out,
  • 6:47 - 6:51
    destination IP, source IP, destination port,
  • 6:51 - 6:52
    source port. And what I’m going to want
  • 6:52 - 6:55
    to do is baseline what the
  • 6:55 - 6:57
    normal IP traffic on my network is.
  • 6:57 - 7:00
    Then, when I see abnormal IP traffic, I
  • 7:00 - 7:03
    want to be alerted about it. And this has
  • 7:03 - 7:05
    varying levels of success based on how
  • 7:05 - 7:08
    random, how many new machines your
  • 7:08 - 7:11
    systems go out and visit. Workstations
  • 7:11 - 7:13
    browsing the Internet are going to
  • 7:13 - 7:15
    have a lot of new IP addresses on a
  • 7:15 - 7:17
    daily basis. Servers are probably
  • 7:17 - 7:20
    not going to go out and talk to a whole
  • 7:20 - 7:23
    lot of different devices. Specialized
  • 7:23 - 7:25
    devices, such as OT (Operational Technology)
  • 7:25 - 7:28
    devices, they won’t talk to a lot of
  • 7:28 - 7:31
    machines. Their communication is
  • 7:31 - 7:34
    pretty standard. So, we can actually use
  • 7:34 - 7:37
    that to understand what’s going on.
  • 7:37 - 7:39
    If I come in here, let’s run that
  • 7:39 - 7:41
    query. The concept is I’m using an
  • 7:41 - 7:43
    all-time query, but that’s because I’m
  • 7:43 - 7:46
    using this Bot v3 data. I have it in the
  • 7:46 - 7:49
    notes in my PowerPoint on how
  • 7:49 - 7:51
    to turn this into a 90-day rolling window.
  • 7:51 - 7:54
    But to make this work on the Bot’s data, I
  • 7:54 - 7:58
    had to actually set my time and do
  • 7:58 - 8:00
    a little bit differently. I’ll
  • 8:00 - 8:02
    show you how that looks when I’m done.
  • 8:02 - 8:03
    What we’re gonna do is: index equals
  • 8:03 - 8:07
    Bot v3, source type equals stream TCP. And
  • 8:07 - 8:08
    what we’re gonna do here, this is the
  • 8:08 - 8:10
    magic: we’re just gonna use a stats
  • 8:10 - 8:12
    command. If I just did stats count by
  • 8:12 - 8:14
    source IP, destination IP, that would give
  • 8:14 - 8:17
    me every tuple that I’ve seen during
  • 8:17 - 8:21
    this window. But if I put the stats min(time),
  • 8:21 - 8:22
    it’s gonna give me the earliest
  • 8:22 - 8:25
    time, the smallest time value that it has
  • 8:25 - 8:28
    seen in this tuple. And so, this is giving
  • 8:28 - 8:30
    me the earliest time this has popped
  • 8:30 - 8:33
    up. So, I’ve got a 90-day rolling window
  • 8:33 - 8:35
    of tuples, and I will tag it with the
  • 8:35 - 8:39
    earliest value seen. If I do that
  • 8:39 - 8:41
    just like this,
  • 8:45 - 8:48
    we’ll see here comes back the earliest time.
  • 8:49 - 8:51
    If I undo that, now I’m going to
  • 8:51 - 8:52
    come in here. I’m gonna change it. Now,
  • 8:52 - 8:55
    I want to set a time. I want to know
  • 8:55 - 8:58
    anytime that this earliest time is
  • 8:58 - 9:00
    greater. So, in a normal scenario, I might go
  • 9:00 - 9:02
    back 86400. That’s the amount of seconds
  • 9:02 - 9:04
    in a day. So, I might be looking for any
  • 9:04 - 9:07
    new tuples in a day. I had to use this
  • 9:07 - 9:10
    value here to move it to a new day based
  • 9:10 - 9:12
    on this Bot v3 data. There’s only two
  • 9:12 - 9:14
    and a half days’ worth of data in this
  • 9:14 - 9:16
    Bot dataset. So, I had to...
  • 9:16 - 9:18
    In order to make it work, so I had to put this
  • 9:18 - 9:19
    specific timestamp in. Normally, you
  • 9:19 - 9:23
    would be using something like now - 86400,
  • 9:23 - 9:25
    and I’ll show that. But we’re going
  • 9:25 - 9:26
    to come down here. We’ll go where
  • 9:26 - 9:29
    earliest time is greater than now. So,
  • 9:29 - 9:32
    if this first time it’s been seen is
  • 9:32 - 9:34
    greater than this, we’re gonna
  • 9:34 - 9:37
    get the values back. If it’s not,
  • 9:37 - 9:39
    this wouldn't show up. That's going to
  • 9:39 - 9:42
    show if I do it like a day timestamp;
  • 9:42 - 9:43
    it’ll only show me any
  • 9:43 - 9:46
    new tuples that I’ve never seen in 90
  • 9:46 - 9:48
    days that have shown up today. So, if I
  • 9:48 - 9:51
    run this, we’re gonna flip this to
  • 9:51 - 9:52
    fast mode.
  • 9:52 - 9:54
    This will come back with all the
  • 9:54 - 9:57
    tuples, the brand-new tuples that it’s ever
  • 9:57 - 9:59
    seen. I’m gonna tell you this is still
  • 9:59 - 10:01
    too large of a list, but part of
  • 10:01 - 10:04
    this list would normally drop down. The
  • 10:04 - 10:06
    fact is, the bigger the window you make,
  • 10:06 - 10:08
    the less values you’ll have. If I’m
  • 10:08 - 10:10
    looking at one day and I’m looking at
  • 10:10 - 10:13
    the new values, you’re gonna have more results. If I
  • 10:13 - 10:16
    go 90 days, the number of new tuples will
  • 10:16 - 10:19
    shrink. The bigger the window you have
  • 10:19 - 10:22
    over here, the smaller the amount of
  • 10:22 - 10:24
    results will come because the more of
  • 10:24 - 10:26
    those, every now and then, that I go to will be
  • 10:26 - 10:30
    included in my list.
  • 10:30 - 10:31
    Alright, this works. Let’s grab
  • 10:31 - 10:34
    something even a little easier to grasp.
  • 10:34 - 10:37
    Now, I’m showing this. These are processes.
  • 10:37 - 10:38
    When I look at the processes, I’m looking
  • 10:38 - 10:41
    at processes firing off. This is
  • 10:41 - 10:43
    calculator being run, application frame
  • 10:43 - 10:45
    host, crash plan desktop. These are
  • 10:45 - 10:48
    processes on a machine. I want to know if
  • 10:48 - 10:51
    there are new processes that have fired
  • 10:51 - 10:54
    off. We’re going back to the exact same
  • 10:54 - 10:56
    query. We’re gonna, this time,
  • 10:56 - 10:59
    group by instance, which is like this
  • 10:59 - 11:00
    host name here.
  • 11:00 - 11:02
    Sorry, instance is application frame host
  • 11:02 - 11:04
    and the host here. We’re gonna look at
  • 11:04 - 11:06
    instance and host, and we’re gonna
  • 11:06 - 11:09
    again grab the earliest time it was
  • 11:09 - 11:11
    seen. And we’re gonna do an eval time
  • 11:11 - 11:12
    when it’s later, and then we’re gonna
  • 11:12 - 11:15
    run it. So, we’re basically saying, “Hey, did
  • 11:15 - 11:17
    I see this value?
  • 11:17 - 11:20
    What’s the newest instances
  • 11:20 - 11:22
    that have fired up in the last 24 hours
  • 11:22 - 11:26
    that I have not seen over my time period?”
  • 11:26 - 11:27
    I run that,
  • 11:28 - 11:32
    and you can see that software
  • 11:32 - 11:34
    running processes are going to be a lot
  • 11:34 - 11:36
    less frequent on your system. And so,
  • 11:36 - 11:37
    we can run that,
  • 11:37 - 11:41
    and we get back the new processes
  • 11:41 - 11:45
    that ran on this machine in the last 24
  • 11:45 - 11:47
    hours. And you can see
  • 11:47 - 11:50
    what happens is SCP and SSH, those
  • 11:50 - 11:52
    are brand new processes. If I was doing
  • 11:52 - 11:55
    an investigation, and all of a sudden machines
  • 11:55 - 11:56
    that have never done it start
  • 11:56 - 11:59
    involving SCP and SSH, that’s probably
  • 11:59 - 12:01
    something I want to be looking at.
  • 12:01 - 12:03
    And so, baselining and knowing what your
  • 12:03 - 12:05
    systems run, and then when new processes
  • 12:05 - 12:07
    fire, we can look at them and say, “Do I
  • 12:07 - 12:08
    want to look at this?” We can build alerts
  • 12:08 - 12:09
    off of it.
  • 12:09 - 12:12
    Let’s jump to another example. This
  • 12:12 - 12:14
    time, listening ports. The amount of ports
  • 12:14 - 12:16
    that your machine is listening on should
  • 12:16 - 12:18
    be very static. It’s not going to change
  • 12:18 - 12:20
    a ton. But if someone’s opened up new
  • 12:20 - 12:22
    applications, they might be opening up
  • 12:22 - 12:24
    new listening ports. So, you want to look
  • 12:24 - 12:26
    at that. We can see here kind of the data
  • 12:26 - 12:30
    coming back. We can see which machine,
  • 12:30 - 12:32
    what ports are being opened, and what they’re
  • 12:32 - 12:36
    listening on. If we use this very same concept
  • 12:40 - 12:44
    we can see min(time) as earliest
  • 12:44 - 12:46
    time. This time, we’re looking at host and
  • 12:46 - 12:48
    desk port. That’s my pairing. That’s what
  • 12:48 - 12:51
    I’m looking for anomalies in. Grab
  • 12:51 - 12:53
    the earliest time seen.
  • 12:53 - 12:55
    Grab the window that I want to
  • 12:55 - 12:57
    track, and we’re going to say where earliest
  • 12:57 - 12:59
    time is greater than now time. And in
  • 12:59 - 13:02
    this one, make sure I flip it to verbose
  • 13:02 - 13:04
    because ports are really static.
  • 13:05 - 13:08
    What a surprise--I’m going to get
  • 13:08 - 13:10
    zero results back. And that’s actually
  • 13:10 - 13:12
    what I’m looking for. That’ll work out
  • 13:12 - 13:14
    really well for me. So, I’ve shown three
  • 13:14 - 13:16
    examples here of how you can just grab
  • 13:16 - 13:18
    any form of data. You look for what you
  • 13:18 - 13:22
    want to find, group it by what’s normal,
  • 13:22 - 13:25
    grab a big window, and then
  • 13:25 - 13:27
    set your time to say anything that’s
  • 13:27 - 13:30
    occurred--any new tuple that I see
  • 13:30 - 13:33
    new since this time.
  • 13:33 - 13:36
    So, we jump over here. We can quickly
  • 13:36 - 13:38
    see this is how it looked in real-time.
  • 13:38 - 13:41
    This is how we do it at my place.
  • 13:41 - 13:47
    Now - 86400, last 90 days. We do the
  • 13:49 - 13:52
    now - 86400. This says, “Give me a
  • 13:52 - 13:56
    90-day window, go back one day.”
  • 13:56 - 13:59
    Very simple. We don’t change much, and
  • 13:59 - 14:01
    we just have that
  • 14:01 - 14:03
    working. Now, if we come over here, we can
  • 14:03 - 14:07
    do the exact same thing with our Splunk
  • 14:07 - 14:10
    searches. We can come over here, and we can
  • 14:10 - 14:13
    take this to the next level in another way.
  • 14:13 - 14:15
    Instead of saying, “I want a 90-day window,”
  • 14:15 - 14:17
    there’s a problem with the 90-day window.
  • 14:17 - 14:19
    As soon as this anomalous event occurs,
  • 14:19 - 14:23
    it’s not going to be anomalous tomorrow.
  • 14:23 - 14:24
    So, what we can do is we can actually
  • 14:24 - 14:27
    build a lookup and say, “I’m going to make
  • 14:27 - 14:29
    everything, I’m gonna do the same
  • 14:29 - 14:32
    concept to pull them together.” This time, I
  • 14:32 - 14:34
    don’t need a time. I’m just going to grab
  • 14:34 - 14:37
    all of my tuples, and I’m going to output
  • 14:37 - 14:40
    lookup into a CSV, or I could do a KV
  • 14:40 - 14:44
    store, and that becomes my window. That's
  • 14:44 - 14:48
    because... So, new anomalous events will not
  • 14:48 - 14:51
    repopulate unless I rerun this output
  • 14:51 - 14:53
    lookup. So, when I run this, I can do
  • 14:53 - 14:56
    this: I build my baseline, and
  • 14:56 - 14:58
    then I’d have a scheduled search
  • 14:58 - 15:00
    or whatever that would search, and
  • 15:00 - 15:02
    I’m going to do it against stats count.
  • 15:02 - 15:03
    I'm going to do a lookup, going to match
  • 15:03 - 15:06
    on source IP and desk IP, and output
  • 15:06 - 15:08
    count, say, as matched. And I’m
  • 15:08 - 15:10
    going to do where isnull(matched),
  • 15:10 - 15:12
    meaning I’ve got a source IP and
  • 15:12 - 15:14
    destination IP, but those are not in this
  • 15:14 - 15:17
    lookup. That would make it null, and this
  • 15:17 - 15:19
    will alert me. And if tomorrow the same
  • 15:19 - 15:21
    source IP and destination IP appears, it
  • 15:21 - 15:23
    will also alert me because as long as
  • 15:23 - 15:28
    I don’t update this lookup table, it will
  • 15:28 - 15:30
    always be anomalous.
  • 15:30 - 15:32
    And so, there are pros and cons. This is a
  • 15:32 - 15:34
    dynamic growing list, like the ones I did
  • 15:34 - 15:37
    over here with where earliest greater
  • 15:37 - 15:39
    than. But over here, we’re building a
  • 15:39 - 15:43
    lookup list, and we’re doing a match. Same
  • 15:43 - 15:44
    principle over here. We can take the
  • 15:44 - 15:47
    perfmon process, exact same thing. We’re
  • 15:47 - 15:50
    going to output it to a CSV, and then we
  • 15:50 - 15:53
    can set up a search to run every day where we
  • 15:53 - 15:56
    do this lookup on instance and host, and
  • 15:56 - 15:59
    where it’s matched. Or we can go to
  • 15:59 - 16:03
    listening ports. We can output the lookup, and
  • 16:03 - 16:04
    we can do this. One of the things you
  • 16:04 - 16:09
    could do is you could actually take an
  • 16:09 - 16:13
    evaluation of the two. You could actually
  • 16:13 - 16:15
    take all the alerts that are
  • 16:15 - 16:18
    popping up each day and compare them to this
  • 16:18 - 16:21
    lookup list and see how much variance
  • 16:21 - 16:24
    there is. So, you could grab a 90-day
  • 16:24 - 16:26
    table and then compare it to this output
  • 16:26 - 16:28
    lookup. There are a lot of ways to evaluate
  • 16:28 - 16:31
    how much is changing in your environment,
  • 16:31 - 16:34
    but the big key is to use your
  • 16:34 - 16:38
    historical data to create a baseline and search on it.
  • 16:40 - 16:44
    Alright, basic summary there. In that
  • 16:44 - 16:46
    video, we showed how we can use the stats
  • 16:46 - 16:48
    command to baseline normal behavior from
  • 16:48 - 16:50
    historical data. We used that baseline to
  • 16:50 - 16:51
    determine new events. We’re able to
  • 16:51 - 16:53
    detect anomalous network connections,
  • 16:53 - 16:55
    anomalous processes, and anomalous open
  • 16:55 - 16:57
    ports. We then did those very same things
  • 16:57 - 17:00
    with a CSV and with baseline normal
  • 17:00 - 17:02
    behavior, and we were able to use that
  • 17:02 - 17:03
    CSV to detect the same thing--network
  • 17:03 - 17:06
    connections, processes, and
  • 17:06 - 17:09
    hosts. So, there are some gotchas that you
  • 17:09 - 17:11
    need to be aware of. This is a really cool
  • 17:11 - 17:13
    process, but as you start to get into it,
  • 17:13 - 17:14
    don’t let the gotchas get you. Don’t let
  • 17:14 - 17:17
    the quest for perfection get in the way
  • 17:17 - 17:19
    of getting something done or having a
  • 17:19 - 17:22
    good product. The rolling window and
  • 17:22 - 17:25
    allow list will get you a good answer.
  • 17:25 - 17:27
    It’s not perfect, and there will be some
  • 17:27 - 17:29
    gotchas along the road, but it will get
  • 17:29 - 17:31
    you most of the way. But now that you’ve
  • 17:31 - 17:33
    got those baselines, we’re going to
  • 17:33 - 17:34
    tell you some things you want to be
  • 17:34 - 17:36
    careful of. Rolling window: You’re going
  • 17:36 - 17:38
    to be alerted the first time that the
  • 17:38 - 17:40
    anomalous connection occurred. And then,
  • 17:40 - 17:43
    if you remember that X and Y, with X being
  • 17:43 - 17:45
    the baseline and Y being the new events, the
  • 17:45 - 17:48
    new events from Y are going to roll into
  • 17:48 - 17:51
    X. And now that anomalous event will be
  • 17:51 - 17:52
    part of your baseline. So, you’ll detect it
  • 17:52 - 17:55
    once, and then your anomaly is part of
  • 17:55 - 17:57
    your baseline. So, you do need to be aware of
  • 17:57 - 17:59
    that. And the frequency of the times
  • 17:59 - 18:02
    you run the alert is important.
  • 18:02 - 18:03
    You need to make sure how often you
  • 18:03 - 18:05
    run this alert. Remember that you can
  • 18:05 - 18:07
    have a small window. Say, “I’m going to
  • 18:07 - 18:09
    look at a 90-day window, and I just want
  • 18:09 - 18:12
    to look at one second.” So, the day will be
  • 18:12 - 18:15
    Y, and it will be one second, and the baseline
  • 18:15 - 18:20
    will be 89 days, 23 hours, 59
  • 18:20 - 18:23
    minutes, and 59 seconds, or whatever. The fact
  • 18:23 - 18:24
    is, it’s still going to look at 90 days'
  • 18:24 - 18:27
    worth of data. And so, no matter how big
  • 18:27 - 18:29
    your Y window is, it’s always going to take the
  • 18:29 - 18:32
    time required to run the entire X
  • 18:32 - 18:35
    and Y window together. So, you need to be
  • 18:35 - 18:37
    aware that it can take some time to run
  • 18:37 - 18:39
    this alert. It sounds great to run a
  • 18:39 - 18:41
    really long query, like “I want an all-time or
  • 18:41 - 18:45
    a year-long query.”
  • 18:45 - 18:46
    Recognize that if you run that every day,
  • 18:46 - 18:48
    you’re still running that query every
  • 18:48 - 18:52
    day. So, it’s going to take some time, and you want
  • 18:52 - 18:53
    to make sure it doesn’t impact the rest
  • 18:53 - 18:55
    of the stuff you’re doing. Allow
  • 18:55 - 18:57
    listing, on the other hand, is
  • 18:57 - 18:59
    going to run against whatever window. So,
  • 18:59 - 19:00
    if you look at the last 10 seconds, it’s
  • 19:00 - 19:02
    only going to run on a 10-second window. If
  • 19:02 - 19:04
    you’re looking at the last hour, it’s going to
  • 19:04 - 19:07
    run on a 1-hour window. So, it will run
  • 19:07 - 19:10
    faster. But, you need to remember that
  • 19:10 - 19:12
    as you run, you need to figure out: How am I going to build that
  • 19:12 - 19:14
    baseline? How do we get new items
  • 19:14 - 19:17
    into the baseline? You’ll need to
  • 19:17 - 19:19
    address that. And remember that a
  • 19:19 - 19:21
    baseline, whether it’s a CSV or a KV store, is
  • 19:21 - 19:24
    going to occupy space on your search
  • 19:24 - 19:26
    head, and you can run out of disk space.
  • 19:26 - 19:28
    You only have so much. Typically, we
  • 19:28 - 19:30
    build a lot of space onto our indexers. Our
  • 19:30 - 19:33
    search heads are not huge on disk space.
  • 19:33 - 19:34
    Just be aware that as you start to build
  • 19:34 - 19:36
    large baselines, one, you'll have
  • 19:36 - 19:39
    performance issues. The more values it
  • 19:39 - 19:40
    has to look against, the slower your
  • 19:40 - 19:42
    search will run, and it's going to take
  • 19:42 - 19:45
    up physical disk space on your machine.
  • 19:45 - 19:48
    So, that's something you need to just be aware of.
  • 19:48 - 19:50
    I'm going to recommend a hybrid
  • 19:50 - 19:52
    approach, and that's the ability to
  • 19:52 - 19:54
    combine both. We're going to do a rolling
  • 19:54 - 19:58
    window and allow listing. And so, the basic
  • 19:58 - 20:01
    concept is we're going to use--your
  • 20:01 - 20:03
    query goes here. So, you're going to write
  • 20:03 - 20:04
    your query, and then you're going to use
  • 20:04 - 20:07
    this collect command. This, this is not
  • 20:07 - 20:09
    a comment about the different syntax in
  • 20:09 - 20:11
    Splunk, but just know that if you use
  • 20:11 - 20:13
    this pipe collect command, you will write
  • 20:13 - 20:15
    to a summary index. Summary indexes are a
  • 20:15 - 20:18
    form of indexing that do not cost you on
  • 20:18 - 20:20
    ingestion license. You can
  • 20:20 - 20:22
    write to a summary index, and then you
  • 20:22 - 20:24
    can query that index just like you could
  • 20:24 - 20:26
    query any other index. And so, you can
  • 20:26 - 20:28
    save your results in an index. The
  • 20:28 - 20:30
    concept is I typically like, if I want to
  • 20:30 - 20:32
    build these, I might write every day a
  • 20:32 - 20:33
    query, and I'm going to write it to the
  • 20:33 - 20:35
    index, and it will timestamp it with
  • 20:35 - 20:37
    today's information. Tomorrow will have
  • 20:37 - 20:38
    tomorrow's information, and yesterday
  • 20:38 - 20:40
    will have yesterday's information. And
  • 20:40 - 20:41
    we can query it and search it. And so,
  • 20:41 - 20:43
    you'll basically write your query. You'll
  • 20:43 - 20:45
    run the collect command index=summary
  • 20:45 - 20:47
    source and give it a name. Then
  • 20:47 - 20:49
    what you want to do is, now that you've
  • 20:49 - 20:51
    done that, you'll build--that's going to
  • 20:51 - 20:54
    be building your alert. Then you're
  • 20:54 - 20:55
    going to come in here, and you're going
  • 20:55 - 20:58
    to look at that summary data, and you're
  • 20:58 - 21:02
    going to append to that those
  • 21:02 - 21:04
    results that fired for that day. So, you
  • 21:04 - 21:06
    look at the last set of time it ran. So
  • 21:06 - 21:08
    maybe you run this once a day. You look
  • 21:08 - 21:10
    at yesterday's results. You put that in
  • 21:10 - 21:11
    here, and then you'll use this append
  • 21:11 - 21:14
    command, which will append the lookup. I
  • 21:14 - 21:16
    said allow list. It should be a disallow
  • 21:16 - 21:18
    list. That was a bad writing here. Gotta
  • 21:18 - 21:21
    love--gotta be careful with the descriptions
  • 21:21 - 21:23
    you use. This is a--you're going to grab a
  • 21:23 - 21:26
    list of things you don’t want to consider to
  • 21:26 - 21:28
    be anomalous. So, if you see these, you
  • 21:28 - 21:31
    want to flag them. It's not what I've
  • 21:31 - 21:33
    done before with the CSV, which is this is my
  • 21:33 - 21:35
    normal baseline. These are bad events. I
  • 21:35 - 21:37
    don’t want these events. And so, I'm going
  • 21:37 - 21:39
    to do the input lookup allowlist.csv,
  • 21:39 - 21:41
    and I'm going to do a table on the
  • 21:41 - 21:44
    matching fields. So, this is matching
  • 21:44 - 21:45
    from this query over here, and then I'm
  • 21:45 - 21:47
    going to stats count by the matching
  • 21:47 - 21:49
    fields. That will basically dedupe.
  • 21:49 - 21:51
    The dedupe command will remove the
  • 21:51 - 21:54
    duplicates, but stats does the same thing,
  • 21:54 - 21:55
    and it's more efficient. So, if you want
  • 21:55 - 21:57
    to write dedupe, you can, but I recommend
  • 21:57 - 21:59
    that you learn the power of the stats
  • 21:59 - 22:02
    command. It is fast. It's got it--it's
  • 22:02 - 22:04
    just the right--it’s the right command to
  • 22:04 - 22:05
    use. So, stats count by matching is
  • 22:05 - 22:07
    basically removing the duplicates. So, if
  • 22:07 - 22:10
    it was in this index summary index and
  • 22:10 - 22:11
    it's in my lookup, we're not going to
  • 22:11 - 22:12
    write it in twice. And then we'll write
  • 22:12 - 22:15
    it to this allow list CSV, which will
  • 22:15 - 22:17
    update it, which means all the new things
  • 22:17 - 22:18
    that were found will then be written
  • 22:18 - 22:21
    into this lookup, and then it will be
  • 22:21 - 22:23
    updated. It'll have a new lookup with the
  • 22:23 - 22:26
    results combined. So, an example that
  • 22:26 - 22:28
    would be: index=bots_v3,
  • 22:28 - 22:32
    source=perfmod_mk_process, stats min_time
  • 22:32 - 22:34
    as earliest_time max. This is
  • 22:34 - 22:36
    all the exact same query. Nothing’s
  • 22:36 - 22:39
    really changed. The difference is, after I
  • 22:39 - 22:40
    do this eval time, I'm not going to do
  • 22:40 - 22:42
    this lookup. Lookup, I'm going to use the
  • 22:42 - 22:44
    name of the CSV. I'm going to do
  • 22:44 - 22:47
    instance as instance, host as host, output
  • 22:47 - 22:49
    instance as recurring. I need a value
  • 22:49 - 22:51
    that shows, hey. I matched the CSV. And
  • 22:51 - 22:53
    then I'm going to go where--and I
  • 22:53 - 22:55
    actually changed this for max time.
  • 22:55 - 22:57
    I want a min time and a max time. And
  • 22:57 - 22:59
    the reason being is the min time is
  • 22:59 - 23:02
    looking to see if the value falls in the
  • 23:02 - 23:04
    X of the X-Y on my rolling window. The max
  • 23:04 - 23:07
    time is used to find if it's in the Y
  • 23:07 - 23:09
    area, and we'll explain that. So, I still
  • 23:09 - 23:10
    have the same where
  • 23:10 - 23:11
    earliest_time > now_time.
  • 23:11 - 23:13
    That’s going to say, “Hey, I’ve never seen this event
  • 23:13 - 23:16
    before.” Or recurring = *, meaning,
  • 23:16 - 23:20
    “Hey, I got a match on this value,” and the
  • 23:20 - 23:22
    latest_time > now_time,
  • 23:22 - 23:25
    meaning it’s in the Y section. That means
  • 23:25 - 23:27
    this alert that is on my list of things
  • 23:27 - 23:30
    I don’t want to keep seeing, I don’t want to allow--
  • 23:30 - 23:33
    it just showed up again. That way,
  • 23:33 - 23:35
    you’ll be notified again that it
  • 23:35 - 23:37
    occurred. You're updating your queue, your
  • 23:37 - 23:40
    lookup file, and you're using a
  • 23:40 - 23:42
    rolling window. And so, you kind of get
  • 23:42 - 23:44
    the best of both worlds, and you can use
  • 23:44 - 23:48
    this as a method to automate
  • 23:48 - 23:51
    keeping up to date on any of your
  • 23:51 - 23:54
    alerts. And now we're going to demo that.
  • 23:54 - 23:56
    I've got a video of it. We're going to go
  • 23:56 - 24:00
    watch that, and then we'll come back.
  • 24:00 - 24:02
    Alright. So, this is the hybrid approach
  • 24:02 - 24:03
    that we've been talking about. It’s going
  • 24:03 - 24:05
    to look exactly like we did before.
  • 24:05 - 24:07
    You've got yourself the normal
  • 24:07 - 24:10
    query. We're going to just make some form
  • 24:10 - 24:12
    of query here. This is going to build
  • 24:12 - 24:13
    our processes, and then we're going to
  • 24:13 - 24:16
    write a collect command. This collect
  • 24:16 - 24:18
    command will write our results
  • 24:18 - 24:22
    into a summary index, and that’s the
  • 24:22 - 24:24
    index=summary. And then we go
  • 24:24 - 24:27
    source=new_process. That’s going
  • 24:27 - 24:30
    to define the name of the source for
  • 24:30 - 24:32
    this summary index. And so, we're going to
  • 24:32 - 24:34
    do that, run the results, and we
  • 24:34 - 24:36
    can see that we got two values coming
  • 24:36 - 24:38
    back. If we jump over here, this is where
  • 24:38 - 24:40
    we're going to take that very same--we
  • 24:40 - 24:42
    can see the summary index being run.
  • 24:42 - 24:44
    There are my two results. We can query them
  • 24:44 - 24:47
    just like any other index, and you'll
  • 24:47 - 24:49
    notice index=summary, source=new_process,
  • 24:49 - 24:49
    and we use this
  • 24:49 - 24:52
    append command. This append is going to
  • 24:52 - 24:55
    add this lookup, this new_process.csv,
  • 24:55 - 24:57
    and then we're going to put in the table
  • 24:57 - 24:59
    of the instance and the host. I'm just
  • 24:59 - 25:00
    going to do a stats count by instance,
  • 25:00 - 25:03
    host. That's basically going to deduplicate so I
  • 25:03 - 25:04
    don’t--it’s going to take the index
  • 25:04 - 25:06
    summary and the input lookup and make
  • 25:06 - 25:08
    them one if there are any duplicates there,
  • 25:08 - 25:10
    so I'll get them twice. That’s what
  • 25:10 - 25:12
    that command’s going to do. It's faster
  • 25:12 - 25:14
    than dedupe, but I’m going to output the lookup
  • 25:14 - 25:17
    to new_process.csv. And that's going to
  • 25:17 - 25:20
    write what was the original new_process.csv,
  • 25:20 - 25:21
    and it’s going to update it
  • 25:21 - 25:24
    with any new fields coming from this summary index.
  • 25:24 - 25:27
    So, we can see that being run if I
  • 25:27 - 25:29
    go over, we're going to get that
  • 25:29 - 25:32
    taken care of. We go run
  • 25:32 - 25:34
    this. What it's going to do is it's going to
  • 25:34 - 25:35
    grab the summary index stuff, and it's
  • 25:35 - 25:36
    going to grab the stuff that was already
  • 25:36 - 25:38
    in my CSV, and it's going to write them
  • 25:38 - 25:42
    in there. And so now I have four values,
  • 25:42 - 25:44
    and those all got written into the
  • 25:44 - 25:46
    CSV. Now I’m going to write my query that
  • 25:46 - 25:48
    I've been doing this rolling window all
  • 25:48 - 25:50
    over again. We're going to do the
  • 25:50 - 25:53
    difference is I'm going to add a max
  • 25:53 - 25:54
    time in there. That's not just going to be
  • 25:54 - 25:56
    a min time. It's also going to be a max
  • 25:56 - 25:57
    time, so I can look at the Y side of the
  • 25:57 - 26:00
    equation. And then I'm going to do this
  • 26:00 - 26:02
    lookup. New_process instances as instance,
  • 26:02 - 26:05
    host as host, output instance as
  • 26:05 - 26:07
    recurring. I need to make the output
  • 26:07 - 26:09
    instance that shows me what matched on
  • 26:09 - 26:10
    this lookup. It’s kind of like a join
  • 26:10 - 26:12
    command. It’s going to join the CSV to
  • 26:12 - 26:14
    the previous values, and whatever
  • 26:14 - 26:17
    matches is going to be output.
  • 26:17 - 26:18
    And I’m going to do the same like I’ve
  • 26:18 - 26:20
    always been doing: earliest_time greater
  • 26:20 - 26:22
    than now_time, and that’s normal. Normal. And then
  • 26:22 - 26:24
    we’re going to add this recurring = *.
  • 26:24 - 26:25
    Recurring = * means it
  • 26:25 - 26:27
    matched on something. I have a value, and
  • 26:27 - 26:30
    latest_time > now_time. And
  • 26:30 - 26:32
    that’s going to look--is there a value in
  • 26:32 - 26:35
    the Y field, and is it a recurring field?
  • 26:35 - 26:37
    And if it is, that's going to alert. And
  • 26:37 - 26:40
    so we can see that being run,
  • 26:47 - 26:49
    and we're just going to see the--
  • 26:49 - 26:50
    we're going to go back to those two
  • 26:50 - 26:52
    fields. These two new
  • 26:52 - 26:55
    fields occurred during the new window,
  • 26:55 - 26:56
    and they'll keep showing up as often as
  • 26:56 - 26:58
    they occur.
  • 27:00 - 27:02
    Alright. We basically showed how we can
  • 27:02 - 27:04
    combine those two approaches in a hybrid
  • 27:04 - 27:07
    approach. We created our lookup of
  • 27:07 - 27:08
    anomalous behaviors so they don't get
  • 27:08 - 27:11
    excluded. We use that to validate how--and
  • 27:11 - 27:12
    then there are other things, like, we can
  • 27:12 - 27:15
    do. We can look at the results of
  • 27:15 - 27:18
    one against the other, and we can see if
  • 27:18 - 27:19
    there are changes in our environment, how
  • 27:19 - 27:21
    much change is going on. There's a lot of
  • 27:21 - 27:23
    things that this gives you the ability to
  • 27:23 - 27:25
    gain more insights into.
  • 27:25 - 27:28
    Alright? So what's next? I've shown
  • 27:28 - 27:30
    you how you can build baselines. I've
  • 27:30 - 27:32
    shown you examples of them. I've given
  • 27:32 - 27:34
    you multiple methods. What I want you to
  • 27:34 - 27:36
    do is now look at your environment. Think
  • 27:36 - 27:39
    right now. What do I have in my
  • 27:39 - 27:41
    environment that I could baseline? What
  • 27:41 - 27:43
    could I grab? What logs could I use? I
  • 27:43 - 27:45
    could take that very same approach, and I
  • 27:45 - 27:46
    want you to think about it right now.
  • 27:46 - 27:50
    Write it down, and let's go do it. This
  • 27:50 - 27:52
    video is great, but if you don't take
  • 27:52 - 27:55
    action on it, this video will not have
  • 27:55 - 27:56
    served its full purpose. So take that
  • 27:56 - 27:58
    time right now to think: What videos,
  • 27:58 - 28:03
    what logs do I have that I can
  • 28:03 - 28:05
    use, and what approach can I take to make a
  • 28:05 - 28:09
    baseline and check for anomalous events?
  • 28:09 - 28:11
    Thank you so much for your time,
  • 28:11 - 28:14
    and I now open it up to questions.
Title:
Anomaly Detection So Easy Your Grandma Can Do It. No ML degree Required Splunk .conf 2024 Rehearsal
Description:

more » « less
Video Language:
English
Duration:
28:21

English subtitles

Revisions Compare revisions