Needle in a haystack, or grep revisited: tre-agrep

Probably everyone who uses a terminal knows the command grep, cf. this excerpt from its man page:

grep searches the named input FILEs (or standard input if no files are named, or if a single hyphen-minus (-) is given as file name) for lines containing a match to the given PATTERN. By default, grep prints the matching lines.

So this is the best tool to search in a big file for a specific pattern, or a specific process in the complete list of running processes, but it has its limitations: it searches for the exact string that you search for, but sometimes it could be useful to do an “approximate” or “fuzzy” search instead.

For this goal the program agrep was firstly developed, from wikipedia we can gain some details about this software:

agrep (approximate grep) is a proprietary approximate string matching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system. It was later ported to OS/2, DOS, and Windows.

It selects the best-suited algorithm for the current query from a variety of the known fastest (built-in) string searching algorithms, including Manber and Wu’s bitap algorithm based on Levenshtein distances.

agrep is also the search engine in the indexer program GLIMPSE. agrep is free for private and non-commercial use only, and belongs to the University of Arizona.

So it’s closed source, but luckily there is an open source source alternative: tre-agrep

Tre Library

TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.

The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time to the length of the used regular expression. In other words, the time complexity of the algorithm is O(M^2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic to the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only in pathological cases which are probably very rare in practice.

Approximate matching

Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match. Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.

INSTALLATION

Tre-agrep it’s usually not installed by default by any distribution but it’s available in many repositories so you can easily install it with the package manager of your distribution, e.g. for Debian/Ubuntu and Mint you can use the command:

apt-get install tre-agrep

BASIC USAGE

The usage is best demonstrated with some simple example of this powerfulcommand, given the file example.txt that contains:

Résumé
RÉSUMÉ
resume
Resümee
rèsümê
Resume
linuxaria

Following is he output of the command tre-agrep with different options:

 mint-desktop tmp # tre-agrep resume example.txt
resume

mint-desktop tmp # tre-agrep -i resume example.txt
resume
Resume

mint-desktop tmp # tre-agrep -1 -i resume example.txt
resume
Resümee
Resume

mint-desktop tmp # tre-agrep -2 -i resume example.txt
Résumé
RÉSUMÉ
resume
Resümee
Resume

As you can see, without any option it returned the same result as a normal grep, the -i option is used to ignore case sensitivity, with the interesting options being -1 and -2: these are the distances allowed in the search, so the larger the number the more results you’ll get since you allow a greater “distance” from the original pattern.

To see the distance of each match you can use the option -s: it prints each match’s cost:

mint-desktop tmp # tre-agrep -5 -s -i resume example.txt
2:Résumé
2:RÉSUMÉ
0:resume
1:Resümee
3:rèsümê
0:Resume
5:linuxaria

So in this example the string Resume has a cost of 0, while linuxaria has a cost of 5.

Further interesting options are those that assign a cost for different operations:

-D NUM, –delete-cost=NUM – Set cost of missing characters to NUM.
-I NUM, –insert-cost=NUM – Set cost of extra characters to NUM.
-S NUM, –substitute-cost=NUM – Set cost of incorrect characters to NUM. Note that a deletion (a missing character) and an insertion (an extra character) together constitute a substituted character, but the cost will be the that of a deletion and an insertion added together.

CONCLUSIONS

The command tre-agrep is yet another small tool that can save your day if you work a lot with terminals and bash scripts.

 

This article was originally published on linuxaria. Castlegem has permission to republish. Thank you, linuxaria!

Hard Disks: Bad Block HowTo

Hardware fails, that is a fact. Nowadays, hard drives are rather reliable, but nevertheless every now and then we will see drives failing or at least having hiccups. Using smartcl/smartd to monitor disks is a good thing, below we will discuss how some lesser issues can be handled without actually having to reboot the system – it is still up to a sys admin’s own discretion to judge circumstances correctly and evaluate whether disk errors encountered are a one time incident or indicative of an entirely failing disk.

Let’s have a look at a typical smartcl -a DEVICE output:

# smartctl -a /dev/sda

...
ID# ATTRIBUTE_NAME          .... RAW_VALUE
197 Current_Pending_Sector  .... 2
...

OK, so we have an oops here. Time to find out what is going on:

# smartctl –test=short /dev/sda

This will take a very short time, a couple of minutes at most, e.g.:

Please wait 2 minutes for test to complete.
Test will complete after Sat Feb  2 16:25:10 2013

Now, with a current pending sector count > 0 we will most likely have an ouch after the test completes:

Num  ..  Status                  Remaining  ..  LBA_of_first_error
...
# 2  ..  Completed: read failure 90%        ..  1825221261
...

LBA counts sectors in units of 512 bytes and starts at 0, so we now need to find out where 1825221261 is actually located:

# fdisk -lu /dev/sda

will display some information about the device in question:

   Device Boot      Start         End      Blocks   Id  System
...
/dev/sda3        31641600  1953523711   960941056   83  Linux
...

Obviously, 1825221261 is on /dev/sda3, thus. Now we need to determine the file system block for our LBA in question, so we first have to get the block size:

# tune2fs -l /dev/sda3 | grep Block

Block count:              240235264
Block size:               4096
Blocks per group:         32768

OK, 4096 bytes. So, the actual block number will be:

(LBA – PARTITION_START_SECTOR) * (512 / BLOCKSIZE)

In our case, this is:

(1825221261 – 31641600) * (512 / 4096) = 224197457.625

We only need the integer part, the fraction just tells us that we are into the 6th sector out of eight that make up this file system block.

It is good practice to find out which inode/file has been affected by using debugfs (operations can take a while with this tool):

# debugfs

debugfs:  open /dev/sda3
debugfs:  icheck BLOCK (224197457 in our case)
Block   Inode number
224197457       56025154
debugfs:  ncheck 56025154
Inode   Pathname
56025154        /some/path/to/file

Now, if this file isn’t anything crucial, then we can start correcting things now:

# dd if=/dev/zero of=/dev/sda3 bs=4096 count=1 seek=BLOCK
  (224197457 here)
# sync

smartctl -a will now show an updated current pending sector count, and you can re-run a short smartctl test.

Source: http://www.vanderzee.org/bad_blocks_howto

 

Migrating Proxmox KVM to Solus / CentOS KVM

By default, Proxmox creates KVM based VMs on a single disk partition, typically in raw or qcow2 format. Solus, however, uses an LVM based system. So how do you move things over from Proxmox to Solus? Here goes:

  1. Shut down the respective Proxmox VM;
  2. As an additional precaution, make a copy of the Proxmox VM (cp will do);
  3. If the Proxmox VM is not in raw format, you need to convert it using qemu-img:
    qemu-img convert PROXMOX_VM_FILE -O raw OUTPUT_FILE
    Proxmox usually stores the image files under /var/lib/vz/images/ID
  4. Create an empty KVM VM on the Solus node with a disk size at least as large as the raw file of the Proxmox VM (and possibly adjust settings such as driver, PAE, etc.), and keep it shut down;
  5. In the config file (usually under /home/kvm/kvmID) of the newly created Solus VM, check the following line:
    <source file=’/dev/VG_NAME/kvmID_img’/>
    and make a note;
  6. dd the Proxmox raw image over to the Solus node:
    dd if=PROXMOX_VM.raw | ssh [options] user@solus_node ‘dd of=/dev/VG_NAME/kvmID_img’
  7.  Boot the new Solus KVM VM;

 

Current virtualisation statistics

Out of pure interest we just collected a snapshot of our current distribution of vitualisation technologies among our client base, below you will find the results (they are not taking into account distortions caused by virtualisation technology available by location, though):

Virtualisation Percentage
OpenVZ 25.21%
XEN PV 40.60%
XEN HVM 14.10%
KVM 20.09%

IOPS and RAID considerations

IOPS (input/output operations per second) are still – maybe even more so than ever – the most prominent and important metric to measure storage performance. With SSD technology finding its way into affordable, mainstream server solutions, providers are eager to outdo each other offering ever higher IOPS dedicated servers and virtual private servers.

While SSD based servers will perform vastly better than SATA or SAS based ones, especially for random I/O, the type of storage alone isn’t everything. Vendors will often quote performance figures using lab conditions only, i.e. the best possible environment for their own technology. In reality, however, we are facing different conditions – several clients competing for I/O, as well as a wide ranging mix of random reads and writes along with sequential I/O (imagine 20 VPS doing dd bs=1M count=128 if=/dev/zero of=test conv=fdatasync).

Since most providers won’t offer their servers without RAID storage, let’s have a look at how RAID setups impact IOPS then. Read operations will usually not incur any penalty since they can use any disk in the array (total theoretical read IOPS available therefore being the sum of the individual disks’ read IOPS), whereas the same is not true for write operations as we can see from the following table:

RAID level Backend Write IOPS per incoming write request
0 1
1 2
5 4
6 6
10 2

We can see that RAID 0 offers the best write IOPS performance – a single incoming write request will equate to a single backend write request – but we also know that RAID 0 bears the risk of total array loss in case a single disk fails. RAID 1 and 10, the latter being providers’ typical or most advertised choice, offers a decent tradeoff – 2 backend writes per single incoming write. RAID 5 and RAID 6, with their additional, robust setup, bear the largest penalty.

When calculating the effective IOPS, thus, keep in mind the write penalty individual RAID setups come with.

The effective IOPS performance of your array can be estimated using the following formula:

IOPSeff = n * IOPSdisk / ( R% + W% * FRAID )

with n being the number of disks in the array, R and W being the read and write percentage, and F being the RAID write factor tabled above.

We can also calculate the total IOPS performance needed based on an effective IOPS workload and a given RAID setup:

IOPStotal = ( IOPSeff * R% ) + ( IOPSeff * W% )

So if we need 500 effective IOPS, and expect around 25% read, and 75% write operations in a RAID 10 setup, we’d need:

500 * 0.25 + 500 * 0.75 * 2 =  875 total IOPS

i.e. our array would have to support at least 875 total, theoretical IOPS. How many disks/drives does this equate to? Today’s solid state drives will easily be able to handle that, but what about SATA or SAS based RAID arrays? A typical SAS 10k hard disk drive will give you around 100-140 IOPS. That means we will need 8 SAS 10k drives to achieve our desired IOPS performance.

Conclusion:
All RAID levels except RAID 0 have significant impact on your storage array’s IOPS performance. The decision about which RAID level to use is therefore not only a question about redundancy or data protection, but also about resulting performance for your application’s needs:

  1. Evaluate your application’s performance requirements;
  2. Evaluate your application’s redundacy needs;
  3. Decide which RAID setup to use;
  4. Calculate the resulting IOPS performance necessary;

 

Sources:

Calculate IOPS in a storage array by Scott Lowe, TechRepublic, 2/2010
Getting the hang of IOPS by Symantec, 6/2012

 

 

 

Increasing the size of a XEN-PV disk

Below is a quick solution to increase the size of a XEN-PV disk that works fine on all our nodes (paths and names may vary depending on your control panel setup, here Solus is being assumed):

  1. shut down the VM
  2. lvextend /dev/VOLUMEGROUP/vmID.img -L +[INTEGER]G
  3. e2fsck -f /dev/VOLUMEGROUP/vmID_img
  4. resize2fs /dev/VOLUMEGROUP/vmID_img
  5. boot the VM
This should work for any standard xen-pv template (we use the ones from Stacklet).

 

Adding disks to Windows VMs under KVM

Reading through various posts on forums and blogs all over the web there are many solutions offered how to add another disk to a Windows VM running under KVM. Below is one solution that worked smoothly for all our nodes running the Solus control panel, with KVM as virtualisation technology:

  1. create a new volume with
    lvcreate -L [INTEGERSIZE]G -n [NEW_VOL_NAME] [VOLUMEGROUPNAME]
  2. edit the vm’s config file (under Solus, this is usually /home/kvm/kvmID/kvmID.xml), and a section below the first disk (assuming hda has already been assigned, we use hdb here for the new disk):
        <disk type='file' device='disk'>
         <source file='/dev/VOLUMEGROUPNAME/NEW_VOL_NAME'/>
         <target dev='hdb' bus='ide'/>
        </disk>
  3. shut down and then boot the vm
  4. log in, and in the storage section of your server administration tool, initialise and format the new disk

NB for Solus: you will have to create a hook and enable advanced config in the control panel, otherwise Solus will overwrite the edited config again. The most basic hook would just hold the production config in a separate file in the same directory, and the hook would ensure that the new file is being used, e.g. from ./hooks/hook_config.sh (must be executable):

#!/bin/sh
mv /home/kvm/kvmID/kvmID.xml /home/kvm/kvmID/kvmID.xml.dist
cp -f /home/kvm/kvmID/kvmID.xml.newdisk /home/kvm/kvmID/kvmID.xml

 

 

Updating CentOS (RHEL, Fedora)

This is just a very concise summary to guide you through the typical update process of a CentOS based Linux server that has no control panel installed on top of it. This post will also appear in our dedicated server hosting BLOG:

  1. run yum check-update from the shell.
    This will give you a list of newly available packages for your distribution based on the repositories you have defined. This list will typically not be too long for a well maintained server, unless the distribution itself has just undergone a major update (such as from CentOS 5.7 to 5.8 recently).
  2. check the packages listed and ensure that your currently running applications will still be compatible with the new versions of any packages updated.
  3. make backups of any individual settings you have made for any packages that are going to be updated (httpd.conf, php.ini, etc.). Usually, these will not be touched, but it doesn’t hurt to make sure you have a copy (in addition to the regular backups you should be doing!).
  4. once you have confirmed that everything should still be fine after the update, from the shell, run yum update.
    This will start the update process, and you will actually have to confirm the update before it is really being processed (last chance to say “no”!).
  5. once complete, restart affected services (such as httpd, for example), or reboot your server if vital system packages have been updated (kernel, libc, …).

Managed or not?

We have had a similar post back in July 20211 (cf. here) , so why are we bringing this up again? Recently, we have had a large surge in two categories of orders: unmanaged lowend VPS (256MB memory and the likes, for use as DNS server, etc.), and fully managed servers.

Customers are increasingly aware of the need to back up their sites with a well managed server. Typically, the managed option often only extends to managing the operating system (and possibly hardware) of the server in question, i.e. updating the operating system with the latest security patches (something that an “intelligent” control panel, such as cPanel, can handle itself, mostly), latest package upgrades, and generally making sure the server works as intended.

In most cases, managed does not, however, cover application issues. This, however, is a crucial point: You as the customer need to be sure that the server administration side of your enterprise speaks the same language as the application development side. Nothing is worse than an eager sysadmin updating a software package without consulting the developers who, incidentally, depend on the older version for the entire site to run smoothly. With nowadays globalisation, this can cause you additional grief – often your developers are from a different company than your ISP, and often they (as is natural) will defend themselves in taking the blame. It will leave you and your enterprise crippled or hindered.

What do we advise?

  1. Don’t save money on a sysadmin.
  2. Make sure your sysadmin talks to your developers and understands what they need.
  3. Make sure your sysadmin has a basic understanding of your application in case of emergencies.
  4. Make sure your staff: your sysadmin and developers coordinate updates and upgrades.
  5. Make sure you have a working test environment where you can run the updates and upgrades in a sandbox to see if afterwards things still work the way they are expected to run.
  6. Have a teamleader coordinate your sysadmin(s) and developer(s), or take this role upon yourself.

How much is it going to cost you?

Fully managed packages vary in cost – the normal sysadmin packages that deal with the operating system only will up your budget by anything between £ 20 to £200 per month, if you want the sysadmin to be an integral part of your team and support your application as well (in terms of coordinated server management), then the price will be more to the higher end of that range, but might possibly also include some support for the application as well already.

Who to hire?

Get someone with experience. There are sysadmins out there who have decades of experience and know the do’s and dont’s, and there are sysadmins who consider themselves divine just because they have been “into linux for 2 years”. A sysadmin is not someone who jumps at the first sight of an available package upgrade and yum installs 200 dependencies to claim he has a system up to date. A sysadmin is someone who understands the implications of a) upgrading and b) not upgrading. A sysadmin will weigh these pros and cons and explain them to you before making suggestions as to what to do. A sysadmin is someone you trust to even take this decision off your shoulder so you can run your business instead of having to worry whether the next admin cowboy is going to blow up your server. A sysadmin is someone who knows not only how to keep a system alive, but also how to bring a failed system back to life.

These are just some general guidelines, contact us for further advice, we are happy to help!