Needle in a haystack, or grep revisited: tre-agrep

Probably everyone who uses a terminal knows the command grep, cf. this excerpt from its man page:

grep searches the named input FILEs (or standard input if no files are named, or if a single hyphen-minus (-) is given as file name) for lines containing a match to the given PATTERN. By default, grep prints the matching lines.

So this is the best tool to search in a big file for a specific pattern, or a specific process in the complete list of running processes, but it has its limitations: it searches for the exact string that you search for, but sometimes it could be useful to do an “approximate” or “fuzzy” search instead.

For this goal the program agrep was firstly developed, from wikipedia we can gain some details about this software:

agrep (approximate grep) is a proprietary approximate string matching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system. It was later ported to OS/2, DOS, and Windows.

It selects the best-suited algorithm for the current query from a variety of the known fastest (built-in) string searching algorithms, including Manber and Wu’s bitap algorithm based on Levenshtein distances.

agrep is also the search engine in the indexer program GLIMPSE. agrep is free for private and non-commercial use only, and belongs to the University of Arizona.

So it’s closed source, but luckily there is an open source source alternative: tre-agrep

Tre Library

TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.

The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time to the length of the used regular expression. In other words, the time complexity of the algorithm is O(M^2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic to the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only in pathological cases which are probably very rare in practice.

Approximate matching

Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match. Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.

INSTALLATION

Tre-agrep it’s usually not installed by default by any distribution but it’s available in many repositories so you can easily install it with the package manager of your distribution, e.g. for Debian/Ubuntu and Mint you can use the command:

apt-get install tre-agrep

BASIC USAGE

The usage is best demonstrated with some simple example of this powerfulcommand, given the file example.txt that contains:

Résumé
RÉSUMÉ
resume
Resümee
rèsümê
Resume
linuxaria

Following is he output of the command tre-agrep with different options:

 mint-desktop tmp # tre-agrep resume example.txt
resume

mint-desktop tmp # tre-agrep -i resume example.txt
resume
Resume

mint-desktop tmp # tre-agrep -1 -i resume example.txt
resume
Resümee
Resume

mint-desktop tmp # tre-agrep -2 -i resume example.txt
Résumé
RÉSUMÉ
resume
Resümee
Resume

As you can see, without any option it returned the same result as a normal grep, the -i option is used to ignore case sensitivity, with the interesting options being -1 and -2: these are the distances allowed in the search, so the larger the number the more results you’ll get since you allow a greater “distance” from the original pattern.

To see the distance of each match you can use the option -s: it prints each match’s cost:

mint-desktop tmp # tre-agrep -5 -s -i resume example.txt
2:Résumé
2:RÉSUMÉ
0:resume
1:Resümee
3:rèsümê
0:Resume
5:linuxaria

So in this example the string Resume has a cost of 0, while linuxaria has a cost of 5.

Further interesting options are those that assign a cost for different operations:

-D NUM, –delete-cost=NUM – Set cost of missing characters to NUM.
-I NUM, –insert-cost=NUM – Set cost of extra characters to NUM.
-S NUM, –substitute-cost=NUM – Set cost of incorrect characters to NUM. Note that a deletion (a missing character) and an insertion (an extra character) together constitute a substituted character, but the cost will be the that of a deletion and an insertion added together.

CONCLUSIONS

The command tre-agrep is yet another small tool that can save your day if you work a lot with terminals and bash scripts.

 

This article was originally published on linuxaria. Castlegem has permission to republish. Thank you, linuxaria!

Hard Disks: Bad Block HowTo

Hardware fails, that is a fact. Nowadays, hard drives are rather reliable, but nevertheless every now and then we will see drives failing or at least having hiccups. Using smartcl/smartd to monitor disks is a good thing, below we will discuss how some lesser issues can be handled without actually having to reboot the system – it is still up to a sys admin’s own discretion to judge circumstances correctly and evaluate whether disk errors encountered are a one time incident or indicative of an entirely failing disk.

Let’s have a look at a typical smartcl -a DEVICE output:

# smartctl -a /dev/sda

...
ID# ATTRIBUTE_NAME          .... RAW_VALUE
197 Current_Pending_Sector  .... 2
...

OK, so we have an oops here. Time to find out what is going on:

# smartctl –test=short /dev/sda

This will take a very short time, a couple of minutes at most, e.g.:

Please wait 2 minutes for test to complete.
Test will complete after Sat Feb  2 16:25:10 2013

Now, with a current pending sector count > 0 we will most likely have an ouch after the test completes:

Num  ..  Status                  Remaining  ..  LBA_of_first_error
...
# 2  ..  Completed: read failure 90%        ..  1825221261
...

LBA counts sectors in units of 512 bytes and starts at 0, so we now need to find out where 1825221261 is actually located:

# fdisk -lu /dev/sda

will display some information about the device in question:

   Device Boot      Start         End      Blocks   Id  System
...
/dev/sda3        31641600  1953523711   960941056   83  Linux
...

Obviously, 1825221261 is on /dev/sda3, thus. Now we need to determine the file system block for our LBA in question, so we first have to get the block size:

# tune2fs -l /dev/sda3 | grep Block

Block count:              240235264
Block size:               4096
Blocks per group:         32768

OK, 4096 bytes. So, the actual block number will be:

(LBA – PARTITION_START_SECTOR) * (512 / BLOCKSIZE)

In our case, this is:

(1825221261 – 31641600) * (512 / 4096) = 224197457.625

We only need the integer part, the fraction just tells us that we are into the 6th sector out of eight that make up this file system block.

It is good practice to find out which inode/file has been affected by using debugfs (operations can take a while with this tool):

# debugfs

debugfs:  open /dev/sda3
debugfs:  icheck BLOCK (224197457 in our case)
Block   Inode number
224197457       56025154
debugfs:  ncheck 56025154
Inode   Pathname
56025154        /some/path/to/file

Now, if this file isn’t anything crucial, then we can start correcting things now:

# dd if=/dev/zero of=/dev/sda3 bs=4096 count=1 seek=BLOCK
  (224197457 here)
# sync

smartctl -a will now show an updated current pending sector count, and you can re-run a short smartctl test.

Source: http://www.vanderzee.org/bad_blocks_howto

 

Migrating Proxmox KVM to Solus / CentOS KVM

By default, Proxmox creates KVM based VMs on a single disk partition, typically in raw or qcow2 format. Solus, however, uses an LVM based system. So how do you move things over from Proxmox to Solus? Here goes:

  1. Shut down the respective Proxmox VM;
  2. As an additional precaution, make a copy of the Proxmox VM (cp will do);
  3. If the Proxmox VM is not in raw format, you need to convert it using qemu-img:
    qemu-img convert PROXMOX_VM_FILE -O raw OUTPUT_FILE
    Proxmox usually stores the image files under /var/lib/vz/images/ID
  4. Create an empty KVM VM on the Solus node with a disk size at least as large as the raw file of the Proxmox VM (and possibly adjust settings such as driver, PAE, etc.), and keep it shut down;
  5. In the config file (usually under /home/kvm/kvmID) of the newly created Solus VM, check the following line:
    <source file=’/dev/VG_NAME/kvmID_img’/>
    and make a note;
  6. dd the Proxmox raw image over to the Solus node:
    dd if=PROXMOX_VM.raw | ssh [options] user@solus_node ‘dd of=/dev/VG_NAME/kvmID_img’
  7.  Boot the new Solus KVM VM;

 

IOPS and RAID considerations

IOPS (input/output operations per second) are still – maybe even more so than ever – the most prominent and important metric to measure storage performance. With SSD technology finding its way into affordable, mainstream server solutions, providers are eager to outdo each other offering ever higher IOPS dedicated servers and virtual private servers.

While SSD based servers will perform vastly better than SATA or SAS based ones, especially for random I/O, the type of storage alone isn’t everything. Vendors will often quote performance figures using lab conditions only, i.e. the best possible environment for their own technology. In reality, however, we are facing different conditions – several clients competing for I/O, as well as a wide ranging mix of random reads and writes along with sequential I/O (imagine 20 VPS doing dd bs=1M count=128 if=/dev/zero of=test conv=fdatasync).

Since most providers won’t offer their servers without RAID storage, let’s have a look at how RAID setups impact IOPS then. Read operations will usually not incur any penalty since they can use any disk in the array (total theoretical read IOPS available therefore being the sum of the individual disks’ read IOPS), whereas the same is not true for write operations as we can see from the following table:

RAID level Backend Write IOPS per incoming write request
0 1
1 2
5 4
6 6
10 2

We can see that RAID 0 offers the best write IOPS performance – a single incoming write request will equate to a single backend write request – but we also know that RAID 0 bears the risk of total array loss in case a single disk fails. RAID 1 and 10, the latter being providers’ typical or most advertised choice, offers a decent tradeoff – 2 backend writes per single incoming write. RAID 5 and RAID 6, with their additional, robust setup, bear the largest penalty.

When calculating the effective IOPS, thus, keep in mind the write penalty individual RAID setups come with.

The effective IOPS performance of your array can be estimated using the following formula:

IOPSeff = n * IOPSdisk / ( R% + W% * FRAID )

with n being the number of disks in the array, R and W being the read and write percentage, and F being the RAID write factor tabled above.

We can also calculate the total IOPS performance needed based on an effective IOPS workload and a given RAID setup:

IOPStotal = ( IOPSeff * R% ) + ( IOPSeff * W% )

So if we need 500 effective IOPS, and expect around 25% read, and 75% write operations in a RAID 10 setup, we’d need:

500 * 0.25 + 500 * 0.75 * 2 =  875 total IOPS

i.e. our array would have to support at least 875 total, theoretical IOPS. How many disks/drives does this equate to? Today’s solid state drives will easily be able to handle that, but what about SATA or SAS based RAID arrays? A typical SAS 10k hard disk drive will give you around 100-140 IOPS. That means we will need 8 SAS 10k drives to achieve our desired IOPS performance.

Conclusion:
All RAID levels except RAID 0 have significant impact on your storage array’s IOPS performance. The decision about which RAID level to use is therefore not only a question about redundancy or data protection, but also about resulting performance for your application’s needs:

  1. Evaluate your application’s performance requirements;
  2. Evaluate your application’s redundacy needs;
  3. Decide which RAID setup to use;
  4. Calculate the resulting IOPS performance necessary;

 

Sources:

Calculate IOPS in a storage array by Scott Lowe, TechRepublic, 2/2010
Getting the hang of IOPS by Symantec, 6/2012

 

 

 

Adding disks to Windows VMs under KVM

Reading through various posts on forums and blogs all over the web there are many solutions offered how to add another disk to a Windows VM running under KVM. Below is one solution that worked smoothly for all our nodes running the Solus control panel, with KVM as virtualisation technology:

  1. create a new volume with
    lvcreate -L [INTEGERSIZE]G -n [NEW_VOL_NAME] [VOLUMEGROUPNAME]
  2. edit the vm’s config file (under Solus, this is usually /home/kvm/kvmID/kvmID.xml), and a section below the first disk (assuming hda has already been assigned, we use hdb here for the new disk):
        <disk type='file' device='disk'>
         <source file='/dev/VOLUMEGROUPNAME/NEW_VOL_NAME'/>
         <target dev='hdb' bus='ide'/>
        </disk>
  3. shut down and then boot the vm
  4. log in, and in the storage section of your server administration tool, initialise and format the new disk

NB for Solus: you will have to create a hook and enable advanced config in the control panel, otherwise Solus will overwrite the edited config again. The most basic hook would just hold the production config in a separate file in the same directory, and the hook would ensure that the new file is being used, e.g. from ./hooks/hook_config.sh (must be executable):

#!/bin/sh
mv /home/kvm/kvmID/kvmID.xml /home/kvm/kvmID/kvmID.xml.dist
cp -f /home/kvm/kvmID/kvmID.xml.newdisk /home/kvm/kvmID/kvmID.xml

 

 

Xen HVM / Solus – Network card / driver issues

Every now and then we run into problems with fully virtualised VMs not recognising their assigned network card. Most often, this happens under Xen HVM and latest Debian/Ubuntu and even CentOS full or netinstall ISOs.

Under Solus CP, there is a very simple fix for this, though the custom config / change of network card does not seem to work properly. Pretty much every Linux distribution should recognise the ne2000 driver:

On the node with the VM having issues, go to /home/xen/vmID and check the vif line in the vmID.cfg file. Take note, and then go to (or create it if it does not exist yet) the hooks directory.

Create an executable file hook_config.sh, and edit it as follows:

#!/bin/sh
grep -Ev ‘vif’ /home/xen/vmID/vmID.cfg > /home/xen/vmID/vmID.cfg.tmp
mv /home/xen/vmID/vmID.cfg.tmp /home/xen/vmID/vmID.cfg
echo “vif        = ['ip=aaa.bbb.ccc.ddd, vifname=vifvmID.0, mac=..., rate=...KB/s, model=e1000']” >> /home/xen/vmID/vmID.cfg

Save it, and reboot your VM. This should let your VM find its network card and allow you to continue with the installation and subsequent production use.

 

Managed or not?

We have had a similar post back in July 20211 (cf. here) , so why are we bringing this up again? Recently, we have had a large surge in two categories of orders: unmanaged lowend VPS (256MB memory and the likes, for use as DNS server, etc.), and fully managed servers.

Customers are increasingly aware of the need to back up their sites with a well managed server. Typically, the managed option often only extends to managing the operating system (and possibly hardware) of the server in question, i.e. updating the operating system with the latest security patches (something that an “intelligent” control panel, such as cPanel, can handle itself, mostly), latest package upgrades, and generally making sure the server works as intended.

In most cases, managed does not, however, cover application issues. This, however, is a crucial point: You as the customer need to be sure that the server administration side of your enterprise speaks the same language as the application development side. Nothing is worse than an eager sysadmin updating a software package without consulting the developers who, incidentally, depend on the older version for the entire site to run smoothly. With nowadays globalisation, this can cause you additional grief – often your developers are from a different company than your ISP, and often they (as is natural) will defend themselves in taking the blame. It will leave you and your enterprise crippled or hindered.

What do we advise?

  1. Don’t save money on a sysadmin.
  2. Make sure your sysadmin talks to your developers and understands what they need.
  3. Make sure your sysadmin has a basic understanding of your application in case of emergencies.
  4. Make sure your staff: your sysadmin and developers coordinate updates and upgrades.
  5. Make sure you have a working test environment where you can run the updates and upgrades in a sandbox to see if afterwards things still work the way they are expected to run.
  6. Have a teamleader coordinate your sysadmin(s) and developer(s), or take this role upon yourself.

How much is it going to cost you?

Fully managed packages vary in cost – the normal sysadmin packages that deal with the operating system only will up your budget by anything between £ 20 to £200 per month, if you want the sysadmin to be an integral part of your team and support your application as well (in terms of coordinated server management), then the price will be more to the higher end of that range, but might possibly also include some support for the application as well already.

Who to hire?

Get someone with experience. There are sysadmins out there who have decades of experience and know the do’s and dont’s, and there are sysadmins who consider themselves divine just because they have been “into linux for 2 years”. A sysadmin is not someone who jumps at the first sight of an available package upgrade and yum installs 200 dependencies to claim he has a system up to date. A sysadmin is someone who understands the implications of a) upgrading and b) not upgrading. A sysadmin will weigh these pros and cons and explain them to you before making suggestions as to what to do. A sysadmin is someone you trust to even take this decision off your shoulder so you can run your business instead of having to worry whether the next admin cowboy is going to blow up your server. A sysadmin is someone who knows not only how to keep a system alive, but also how to bring a failed system back to life.

These are just some general guidelines, contact us for further advice, we are happy to help!

Backups

When we mention backups, everybody will think, “hey, my data is safe anyway, isn’t it? I mean this is a reputable ISP, sure they have enterprise disks and RAID, and whatnot? Or don’t they?!”.

There are two important NBs when it comes to backing up your data, be it on a Virtual Private Server (VPS) or on dedicated servers:

  1. Don’t assume anyone but you is going to back up your data.
  2. Don’t assume that even if your ISP backs up your data that you shouldn’t as well.

By default, it is safe to assume that your provider does not back up your data. Typically, explicit backups will cost you some additional money, and even then you are well advised to ask your ISP what they are backing up, and how, how often they do they it, and where the data is being kept.

A couple of bad backup solutions:

  • different disk (or, worse even, partition) on the same machine;
  • some external drive, like a USB disk;

A couple of decent/workable backup solutions:

  • standby server in the same DC;
  • ftp space on some other machine in the same DC;
  • making sure disks are RAID (this is not, however, a real backup strategy, it just helps to gain some redundancy and should be treated more like a complementing measure; no ISP should, unless explicitly asked to, offer you a setup without RAID; a disk failure in a RAID setup at least allows online recovery in hot swap environments);

A good backup solution:

  • generation driven backup strategy on a server or backup system (such as IBM’s TSM which can backup to SAN and tape, or bacula, which is free of charge and can perform full/differential/incremental backups for example) in a different DC;

Any ISP employing one of the bad solutions means that you should explicitly look for a service that allows you to at least back up your data somewhere else as well, in a different data centre. You should also consider this when your ISP can only offer a backup solution that can be considered workable at best. If your ISP, however, can prove that they are using an enterprise solution to back up your data, then you can assume that your data is safe – nevertheless you should back up your data as well. At least make dumps and tgz’s of your most important data, download it, and store it safely away, burn it to CD/DVDs, etc. Be prepared for the worst case, backups can go corrupt, you might accidentally delete all instances of one file desperately needed, etc.

Backing up is only half the story – data backed up is pretty, but you also need to be able to restore it. Make sure to test your backup/restore strategy. Back up data, restore it, see if it works. Repeat this in regular intervals, and repeat it whenever you do major changes to your application or when you need to document milestones, etc. Ask yourself, how much is your data worth to you? What if you lose everything? When it comes to your data, it comes to your online presence, your enterprise and company: Don’t assume. Make sure.

 

 

Checking connectivity

There are various tools to measure and check connectivity of your dedicated server and virtual private server. Below we will give an overview over the most common ones, along with their most widespread use.

  1. ping
    ping is probably the most well known tool to check whether a server is up or not. A ping is a small packet of traffic sent from the originating machine to the destination machine which expects a so called echo reply to see whether the destination host is up and running and responding. The typical Linux syntax is:
    ping [-c INTEGER] [-n] [-q] HOSTNAME | IP address
    with -c followed by the number of packets to send, -n for numeric (IP address only – no dns resolution), and -q for quiet output so that only the summary lines will be displayed. The output will display how long it takes for each packet (or the packets on average) to travel back and forth between the source and destination host (round trip time). Large deviations in the min/avg/max values may indicate network congestion, whereas significant packet loss may indicate general network outages or congestion to a point where the network is simply too overloaded to allow anything else through and just drop packets instead. A 100% packet loss may, however, not necessarily indicate that the destination host is dead – it may simply be that the destination server is blocking ICMP ping packets via its firewall.
  2. traceroute
    traceroute is another useful tool that displays the route packets take from the originating host to the destination machine. It also displays round trip times, and can be used to identify potential issues on the way to the machine as well. It is important to understand that firewalls and routers are able to filter and deny these probing packets as well, so a non responding host may not necessarily be down, just as with ping. The typical Linux syntax is
    traceroute [-n] HOST | IP address
  3. mtr
    mtr can be seen as the combination of ping and traceroute – it displays not only the way packets travel down the line from the source to the destination, but also displays min/avg/max round trip statistics and packet loss. mtr is very helpful in determining network congestions or anomalies. The typical Linux syntax is
    mtr [-n] HOST | IP address

When would you typically use these tools:

  • when a host that is normally up can suddenly no longer be reached;
  • when you notice anomalies like slow network, packet loss, etc.;
  • when you want to prove that things are working OK on your end;

 

Monitoring your server

It is very important that you monitor your server, and by that is not only meant whether it is up or not, but a much more detailed view into what is going on. Popular, open source, monitoring tools are nagios, cacti, munin, and zabbix, and it is not uncommon to use them in combination as well.

What, now, are the stats you should be monitoring in general:

  • uptime – pinging the server (provided ICMP replies are not being filtered) to check whether it is alive or not;
  • disk space – monitoring all partitions on their free space. A full root partition is particularly nasty as it can bring your entire server to a stop, but it is not difficult to see that any full partition is generally a bad thing that can cause disastruous side effects;
  • memory consumption – how much physical RAM is left, how much is being used by the system, by applications, etc. Is swap space in use, how often is it being used, etc.;
  • CPU utilisation – how loaded is the CPU, do you have enough reserves, or are you already using the CPU near its capacity limits, how many processes are being run at the same time, etc.;
  • service monitoring – are all the services on your server running as planned? Such as apache, mysqld, sshd, etc.;
  • database monitoring – what is your database doing, how many queries per second are being executed, how many simultaneous connections do you have, and so on;
  • network traffic – is your server generating a lot of unwanted traffic, do you have any unusual spikes, or how much traffic are you using, anyway?

These are just examples, but they give you an idea of what can be done – the actual number of checks and monitoring scripts is legion, and it will be up to you and your ISP to decide which ones to implement. It is always advisable to use monitoring, it not only means you will have your own alert system when things go wrong, but it will also give you excellent insights into the general development in terms of use and capacity of your server, allowing you to plan ahead much more accurately than without monitoring and statistics collection.

Our advice: talk to your ISP about monitoring options. Some do it for free, some will charge a bit, but being ahead of the competition and having the ability to act proactively is a big advantage for any business, especially in IT, where information is key.